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Probabilistic logic programs are logic programs in which some of the facts are annotated 

with probabilities. This paper investigates how classical inference and learning tasks known 
from the graphical model community can be tackled for probabilistic logic programs. 
Several such tasks such as computing the marginals given evidence and learning from 
(partial) interpretations have not really been addressed for probabilistic logic programs 

f— ^ before. 

__ The first contribution of this paper is a suite of efficient algorithms for various inference 

tasks. It is based on a conversion of the program and the queries and evidence to a weighted 

\Q Boolean formula. This allows us to reduce the inference tasks to well-studied tasks such 

I* as weighted model counting, which can be solved using state-of-the-art methods known 

from the graphical model and knowledge compilation literature. The second contribution 

rf\ is an algorithm for parameter estimation in the learning from interpretations setting. 

i—l The algorithm employs Expectation Maximization, and is built on top of the developed 

L" inference algorithms. 

. J^ The proposed approach is experimentally evaluated. The results show that the infer- 

ence algorithms improve upon the state-of-the-art in probabilistic logic programming and 
that it is indeed possible to learn the parameters of a probabilistic logic program from 
Cu interpretations. 

KEYWORDS: Probabilistic logic programming, Probabilistic inference, Parameter learn- 
ing 



1 Introduction 

There is a lot of interest in combining probability and logic for dealing with complex 
relational domains. This interest has resulted in the fields of Probabilistic Logic 



Programming (PLP) (De Raedt et al. 2008) and Statistical Relational Learning 



(SRL) (Getoor and Taskar 2007). While the two fields essentially study the same 
problem, there are differences in emphasis. SRL techniques have focussed on the 
extension of probabilistic graphical models like Markov or Bayesian networks with 
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logical and relational representations, as in for instance Markov logic (Poon and 



Domingos 2006 ) . Conversely, PLP has extended logic programming languages (or 
Prolog) with probabilities. This has resulted in differences in representation and 
semantics between the two approaches and, more importantly, also in differences in 
the inference tasks and learning settings that are supported. In graphical models 
and SRL, the most common inference tasks are that of computing the marginal 
probability of a set of random variables given some evidence (we call this the MARG 
task) and finding the most likely joint state of the random variables given the 
evidence (the MPE task). The PLP community has mostly focussed on computing 
the success probability of queries without evidence. Furthermore, probabilistic logic 



programs are usually learned from entailment ( Sato and Kameya 2008 Gutmann 



et al. 2008a), while the standard learning setting in graphical models and SRL 



corresponds to learning from interpretations. This paper bridges the gap between 
the two communities, by adapting the traditional graphical model and SRL settings 
towards the PLP perspective. We contribute general MARG and MPE inference 
techniques and a learning from interpretations algorithm for PLP. In this paper 



we use ProbLog (De Raedt et al. 2007) as the PLP language, but our approach 



is relevant to related languages like ICL (Poole 2008), PRISM (Sato and Kameya 



2008) and LPAD/CP-logic (Vennekens et al. 2009) as well 



The first key contribution of this paper is a two-step approach for performing 
MARG and MPE inference in probabilistic logic programs. In the first step, the 
program is converted to an equivalent weighted Boolean (propositional) formula. 
This conversion is based on well-known conversions from the knowledge representa- 
tion and logic programming literature. The MARG task then reduces to weighted 
model counting (WMC) on the resulting weighted formula, and the MPE task to 
weighted MAX-SAT. The second step then involves calling a state-of-the-art solver 
for WMC or MAX-SAT. In this way, we establish new links between PLP inference 
and standard problems such as WMC and MAX-SAT. We also identify a novel 



connection between PLP and Markov Logic (Poon and Domingos 2006). From a 



probabilistic perspective, our approach is similar to the work of Darwiche (2009) 



and others (Sang et al. 2005 Park 2002), who perform Bayesian network inference 



by conversion to weighted formulas. We do the same for PLP, a much more expres- 
sive representation framework than traditional graphical models. PLP extends a 
programming language and allows us to concisely represent large sets of dependen- 
cies between random variables. From a logical perspective, our approach is related 
to Answer Set Programming (ASP), where models are often computed by trans- 



lating the ASP program to a Boolean formula and applying a SAT solver ( Lin and 



Zhao 2002). Our approach is similar in spirit, but is different in that it employs a 



probabilistic context. 

The second key contribution of this paper is an algorithm for learning the pa- 
rameters of probabilistic logic programs from data. We use the learning from in- 
terpretations (LFI) setting, which is the standard setting in graphical models and 
SRL (although they use different terminology). This setting has also received a lot 



of attention in inductive logic programming (De Raedt 2008), but has not yet been 



used for probabilistic logic programs. Our algorithm, called LFI-ProbLog, is based 
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on Expectation-Maximization (EM) and is built on top of the inference techniques 
presented in this paper. 



The present paper is based on and integrates our previous papers ( Fierens et al. 



2011 Gutmann et al. 2011) in which inference and learning were studied and im- 



plemented separately. Historically, the learning from interpretations approach as 



detailed by Gutmann et al. (2011) and Gutmann et al. (2010) was developed first 



and used BDDs for inference and learning. The use of BDDs for learning in an 



EM-style is related to the approach of Ishihata et al. (2008), who developed an EM 



algorithm for prepositional BDDs and suggested that their approach can be used to 



perform learning from entailment for PRISM programs. Fierens et al. (2011 ) later 
showed that an alternative approach to inference - that is more general, efficient 
and principled - can be realized using weighted model counting and compilation 



to d-DNNFs rather than BDDs as in the initial ProbLog implementation (Kim 



mig et al. 2010). The present paper employs the approach by Fierens et al. also 
for learning from interpretations in an EM-style and thus integrates the two ear- 
lier approaches. The resulting techniques are integrated in a novel implementation, 



called ProbLog2. While the first ProbLog implementation (Kimmig et al. 2010) 
was tightly integrated in the YAP Prolog engine and employed BDDs, ProbLog2 
is much closer in spirit to some Answer Set Programming systems than to Prolog 
and it employs d-DNNFs and weighted model counting. 

This paper is organized as follows. We first review the necessary background 
(Section [2| and introduce PLP (Section [3]). Next we state the inference tasks that 
we consider (Section El). Then we introduce our two-step approach for inference 
(Section [5] and [6]), and introduce the new learning algorithm (Section |7j. Finally 
we briefly discuss the implementation of the new system (Section [8| and evaluate 
the entire approach by means of experiments on relational data (Section [9]). 

2 Background 

We now review the basics of first-order logic (FOL) and logic programming (LP). 
Readers familiar with FOL and LP can safely skip this section. 

2.1 First- Order Logic (FOL) 

A term is a variable, a constant, or a functor applied to terms. An atom is of the 
form p(t\, . . . , t n ) where p is a predicate of arity n and the U are terms. A formula 
is built out of atoms using universal and existential quantifiers and the usual logical 
connectives -i, V, A, — > and -f-K A FOL theory is a set of formulas that implicitly 
form a conjunction. An expression is called ground if it does not contain variables. 
A ground (or propositional) theory is said to be in conjunctive normal form (CNF) 
if it is a conjunction of disjunctions of literals. A literal is an atom or its negation. 
Each disjunction of literals is called a clause. A disjunction consisting of a single 
literal is called a unit clause. Each ground theory can be written in CNF form. 

The Herbrand base of a FOL theory is the set of all ground atoms constructed 
using the predicates, functors and constants in the theory. A Herbrand interprcta- 
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tion, also called a (possible) world, is an assignment of a truth value to all atoms 
in the Herbrand base. A world or interpretation is called a model of the theory if it 
satisfies all formulas in the theory (in other words, if all formulas evaluate to true 
in that world). 



2.2 Logic Programming (LP) 

Syntactically, a normal logic program, or briefly iogic program (LP) is a set of 
rules. A rule (also called a normal clause) is a universally quantified expression of 
the form h : - bl , ... , bn, where h is an atom and 61, . . . , b n are literals. The 
atom h is called the head of the rule and 61 , . . . , b n the body, representing the 
conjunction b\ A . . . A b n . A fact is a rule that has true as its body and is written 
more compactly as h. 



We use the well-founded semantics for LPs (Van Gelder et al. 1991 ). In the case 
of a negation-free LP (or definite program) , the well-founded model is identical to 
the well-known Least Herbrand Model (LHM). The LHM is equal to the least of all 
models obtained when interpreting the LP as a FOL theory of implications. The 
ieast model is the model that is a subset of all other models (in the sense that it 
makes the fewest atoms true). Intuitively, the LHM is the set of all ground atoms 
that are entailed by the LP. For negation-free LPs, the LHM is guaranteed to exist 
and be unique. For LPs with negation, we use the well-founded model. We refer 
to Van Gelder et al. (1991 ) for details. The ProbLog semantics requires all consid- 



ered logic programs to have a two- valued well-founded model (see Section 3.2). For 



such programs, the well-founded model is identical to the stable model (Van Gelder 



et al. 1991| ). 

Intuitively, the reason why one considers only the ieast model of an LP is that LP 
semantics makes the closed world assumption (CWA). Under the CWA, everything 
that is not implied to be true is assumed to be false. This has implications on how 
to interpret rules. Given a ground LP and an atom o, the set of all rules with a 
in the head should be read as the definition of a: the atom a is defined to be true 
if and only if at least one of the rule bodies is true (the 'only if is due to the 
CWA). This means that there is a crucial difference in semantics between LP and 
FOL since FOL does not make the CWA. For example, the FOL theory {a <— b} 
has 3 models {^a,^b}, {a,^b} and {a, b}. The LP {a :- b} has only one model, 
namely the least Herbrand model {^a,^&} (intuitively, a and b are false because 
there is no rule that makes b true, and hence there is no applicable rule that makes 
a true either). 

Because of the syntactic restrictions of LP, it is tempting to believe that FOL 
is more 'expressive' than LP. This is wrong because of the difference in semantics: 
certain concepts that can be expressed in LP cannot be expressed in FOL (see 



Section 3.3 for details). This motivates our interest in LP and PLP. 
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3 Probabilistic Logic Programming and ProbLog 



Most probabilistic logic programming languages, including PRISM (Sato and Kamej a 



20081), ICL ( |Poole 2008| ), ProbLog ( |De Raedt et al. 2007[) and LPA D ( [Vennekens 



et al. 2009), are based on Sato's distribution semantics (Sato 1995). In this paper 



we use ProbLog, but our approach can be used for the other languages as well. 



3. 1 Syntax of ProbLog 

A ProbLog program consists of two parts: a set of ground probabilistic facts, and 
a logic program, i.e. a set of rules and ('non-probabilistic') facts. A ground prob- 
abilistic [act, written p: :f , is a ground fact f annotated with a probability p. Wc 
allow syntactic sugar for compactly specifying an entire set of probabilistic facts 
with a single statement. Concretely, we allow what we call intensional probabilistic 
facts, which are statements of the form p: :f (XI, X2, . . . ,Xn) :- body, with body 
a conjunction of calls to non-probabilistic factsJjThe idea is that such a statement 
defines the domains of the variables XI, X2, ... and Xn. When defining the 
semantics, as well as when performing inference or learning, an intensional prob- 
abilistic fact should be replaced by its corresponding set of ground probabilistic 
facts, as illustrated below. An atom that unifies with a ground probabilistic fact is 
called a probabilistic atom, while an atom that unifies with the head of some rule 
in the logic program is called a derived atom. The set of probabilistic atoms must 
be disjoint from the set of derived atoms. Also, the rules in the program should be 
range-restricted: all variables in the head of a rule should also appear in a positive 
literal in the body of the rule. 

Our running example is the program that models the well-known 'Alarm' Bayesian 
network. 

Example 1 {Running Example) 

. 1 : :burglary . person(mary) . 

. 2 :: earthquake . person(john) . 

.7: :hears_alarm(X) :- person(X) . 

alarm : - burglary . 

alarm : - earthquake . 

calls(X) :- alarm, hears_alarm(X) . 

This Problog program consists of probabilistic facts and a logic program. Pred- 
icates of probabilistic atoms are burglary/0, earthquake/0 and hears_alarm/l, 
predicates of derived atoms are person/1, alarm/0 and calls/1. Intuitively, the 
probabilistic facts 0.1:: burglary and 0.2:: earthquake state that there is a bur- 
glary with probability 0.1 and an earthquake with probability 0.2. The statement 
.7: :hears_alarm(X) :- person(X) is an intensional probabilistic fact and is 
syntactic sugar for the following set of ground probabilistic facts. 



1 The notion of intensional probabilistic facts does not appear in earlier ProbLog papers but is 
often useful in practice. 
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0.7: :hears_alarm(mary) . 
0.7: :hears_alarm(john) . 

The rules in the program define when the alarm goes off and when a person calls, 
as a function of the probabilistic facts. 

3.2 Semantics of ProbLog 

A ProbLog program specifies a probability distribution over possible worlds. To 
define this distribution, it is easiest to consider the grounding of the program with 
respect to the Herbrand basejjln this paper, we assume that the resulting Herbrand 



base is finite. For the distribution semantics in the infinite case, see Sato (1995). 

Each ground probabilistic fact p: :f gives an atomic choice, i.e. we can choose 
to include / as a fact (with probability p) or discard it (with probability 1 — p). A 
total choice is obtained by making an atomic choice for each ground probabilistic 
fact. Formally, a total choice is any subset of the set of all ground probabilistic 
atoms. Hence, if there are n ground probabilistic atoms then there are 2" total 
choices. Moreover, we have a probability distribution over these total choices: the 
probability of a total choice is defined to be the product of the probabilities of the 
atomic choices that it is composed of (we can take the product since atomic choices 
are seen as independent events). 

Example 2 ( Total Choices of the Alarm Example) 

Consider the Alarm program of Example[T] The 2 4 = 16 total choices corresponding 
to the 4 ground probabilistic atoms are given in Table [I] The first row corresponds 
to the total choice in which all the probabilistic atoms are true. The probability of 
this total choice is 0.1 x 0.2 x 0.7 x 0.7 = 0.0098. The second row corresponds to 
the same total choice except that hear scalar m{mary) is now false. The probability 
is 0.1 x 0.2 x 0.7 x (1-0.7) = 0.0042. The sum of probabilities of all 16 total choices 
is equal to one. 

Given a particular total choice C, we obtain a logic program C U R, where R 
denotes the rules in the ProbLog program. We denote the well-founded model of this 
logic program as WFM(C U R)r\ We call a given world u a model of the ProbLog 
program if there indeed exists a total choice C such that WFM(C U R) = w. We 
use MOD(L) to denote the set of all models of a ProbLog program L. The ProbLog 
semantics is only well-defined for programs that are sound ( |Riguzzi and Swift 2013[ ) , 
i.e., programs for which each possible total choice C leads to a well-founded model 
that is two- valued or 'total' (Riguzzi and Swift 2013 Van Gelder et al. 199l[ )p] 



Programs for which this is not the case are not considered valid ProbLog programs. 
Everything is now in place to define the distribution over possible worlds: the 

2 Beforehand, a preprocessing step already replaced the intensional probabilistic facts with their 
corresponding grou nd s et, as illustrated before. 



Recall from Section 



2.2 



that for negation-free programs, the WFM is the least Herbrand model. 



4 A sufficient condition fo r this is that the rules in the ProbLog program are locally stratified 
(IVan Gelder et al. 19911). In particular, this trivially holds for all negation- free programs. 



Inference and Learning in PLP using Weighted Formulas 



Table 1. Total choices and their probabilities 
I Total choice C 



P(C) 



1 


{ 


2 


{ 


3 


{ 


4 


{ 


5 


{ 


6 


{ 


7 


I 


8 


{ 


9 


{ 


10 


{ 


11 


{ 


12 


{ 


13 


{ 


14 


{ 


15 


{ 


16 


{ 



burglary, earthquake, hear S-alarm( John), hears-alarmimary) } 

burglary, earthquake, hears _alar -m(john) } 

burglary, earthquake, hears _alar ■m(mary) } 

burglary, earthquake } 

burglary, hears_alarm(john), hears_alarm(mary) } 

burglary, hear scalar mijohri) } 

burglary, hears -alar m(mary) } 

burglary } 

earthquake, hear scalar -m(john), hear scalar m(mary) } 

earthquake, hears -alarm(john) } 

earthquake, hears _alarm(mary) } 

earthquake } 

hears -alar mijohn), hears -alar mirnary) } 

hears _alar -m(john) } 

hears -alar m(mary) } 

} 



0.0098 
0.0042 
0.0042 
0.0018 
0.0392 
0.0168 
0.0168 
0.0072 
0.0882 
0.0378 
0.0378 
0.0162 
0.3528 
0.1512 
0.1512 
0.0648 



probability of a world that is a model of the ProbLog program is equal to the 
probability of its total choice; the probability of a world that is not a model is 0. 

Example 3 {Models and their probabilities) 

(Continuing Example[2| The total choice {burglary, earthquake, hears-alarm(john)}, 

which has probability 0.1 x 0.2 x 0.7 x (1-0.7) = 0.0042, yields the following logic 

program. 

burglary. person (mary) . 

earthquake. person (j ohn) . 

hears_alarm(john) . 

alarm : - earthquake . 

alarm : - burglary . 

calls(X) :- alarm, hears_alarm(X) . 

The WFM of this program is the world {per son{mary), per son{ John), burglary, 
earthquake, hear S-alarm(j ohn), shears _alarm(mary) , alarm, calls(john), -^calls(mary)} . 
Hence this world is a model and its probability is 0.0042. In total there are 16 mod- 
els, corresponding to each of the 16 total choices shown in Table[l] Note that, out of 
all possible interpretations of the vocabulary, there are many that are not models of 
the ProbLog program. An example is any world of the form {burglary, ^alarm, . . .}: 
it is impossible that alarm is false while burglary is true. The probability of such 
worlds is zero. 



3.3 Related Languages 

ProbLog is strongly related to several other languages, in particular to Probabilistic 
Logic Programming (PLP) languages like PRISM (Sato and Kameya 2008), ICL 
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(Poole 2008) and LPAD (Vennekens et al. 2009), and other languages like Markov 
Logic ( Poon and Domingos 2006 1. Table [2] shows the main features of each language 



and the major corresponding system. 



Table 2. Overview of features of several probabilistic logical languages and the 
corresponding systems (implementations). The first three features are properties 
of the language, the last two are properties of the system. We refer to the first 
ProbLog system as ProbLogl and to the system described here as ProbLog2. 

Language ProbLog ProbLog PRISM ICL LPAD MLN 

System ProbLogl ProbLog2 PRISM AILog2 PITA Alchemy 



Cyclic rules / / 

Overlapping 

rule bodies 



Inductive 
definitions 



/ / 

/ / / 



- 


/ 


• 


/ 


/ 


n/a 


/ 


/ 





Evidence on 

arbitrary atoms 

Multiple queries — / — — — / 



/ - / - / 



Compared to most other PLP languages, ProbLog is more expressive with respect 
to the rules that are allowed in a program. This holds in particular for PRISM 
and ICL. Both PRISM and ICL require the rules to be acyclic (or contingently 



acyclic) ( Sato and Kameya 2008 Poole 2008 ) . In ProbLog we can have cyclic pro 



grams with rules such as smokes (X) :- smokes(Y), inf luences(Y,X). This type 
of cyclic rules are often needed for tasks such as collective classification or social 
network analysis (see Section [9]) . In addition to acyclicity, PRISM also requires 
rules with unifiable heads to have mutually exclusive bodies (such that at most 
one of these bodies can be true simultaneously; this is the mutual exclusiveness 
assumption). ProbLog does not have this restriction, so rules with unifiable heads 
can have 'overlapping' bodies. For instance, the bodies of the two alarm rules in 
our running example are overlapping: either burglary or earthquake is sufficient for 
making the alarm go off, but both can also happen at the same time. 



LPADs, as used in the PITA system (Riguzzi and Swift 2013), do not have 
these syntactic restrictions, and are hence on par with ProbLog in this respect. 
However, the PITA system does not support the same tasks as the new ProbLog2 
system does. For instance, when computing marginal probabilities, ProbLog2 can 
deal with multiple queries simultaneously and can incorporate evidence, while PITA 
uses the more traditional PLP setting which considers one query at a time, without 
evidence (the succes probability setting, see SectionEJ. The same also holds for the 



first ProbLog system (Kimmig et al. 2010). Note that while evidence can in some 
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special cases be incorporated through modellingjjwe here focus on the general case, 
i.e., the ability of the system to handle evidence on any arbitrary subset of all atoms 
in the Herbrand base. 

ProbLog2 is the first PLP system that posesses all the features considered in 
Table [2] i.e., that supports multiple queries and evidence while having none of the 
language restrictions. The experiments in this paper (Section [9| require all these 
features and can hence only be carried out in ProbLog2, but not in the other PLP 
systems. 



Markov Logic ( Poon and Domingos 2006 ) is strictly speaking not a PLP language 



as it is based on First-Order Logic instead of Logic Programming. Nevertheless, 
Markov Logic of course serves the same purpose as the above PLP languages. In 
terms of expressivity, Markov Logic has the drawback that it cannot express (non- 
ground) inductive definitions. An example of an inductive definition is the definition 
of the notion of a path in a graph in terms of the edges. This can be written in 
plain Prolog and hence also in ProbLog. 

path(X.Y) :- edge(X.Y) . 

path(X.Y) :- edge(X,Z), path(Z.Y). 

In the knowledge representation community, it is well-known that inductive defini- 
tions can naturally be represented in Logic Programming (LP), due to LP's least 



or well-founded model semantics (Dencckcr ct al. 2001). In contrast, in First-Order 



Logic (FOL) one cannot express non-ground inductive definitions, such as the path 



definition above (Gradel 1992 1. The reason is, roughly speaking, that path is the 
transitive closure of edge, and FOL can express that a given relation is transitive, 
but cannot in general specify this closure. This result carries over to the probabilis- 
tic case: we can express inductive definitions in PLP languages like ProbLog but 
not in FOL-based languages like Markov Logic £] While the non-probabilistic case 



has been well-studied in the knowledge representation literature (Denecker et al. 



2001 



Gradel 1992), the probabilistic case has only very recently received attention 



(Fierens et al. 2012) 



4 Inference Tasks 

In the literature on probabilistic graphical models and statistical relational learn- 
ing, the two most common inference tasks are computing the marginal probability 
of a set of random variables given some observations or evidence (we call this the 
MARG task), and finding the most likely joint state of the random variables given 
the evidence (known as the MPE task, for Most Probable Explanation) . In PLP, the 



5 For instance, when encoding a Bayesian network in PLP, evidence on nodes at the top of the 
network (nodes without parents) can be incorporated by including deterministic facts in the 
program. 

6 This discussion applies to non-ground ProbLog programs and Markov Logic Networks (MLNs). 
In Section [53] we show that every ground ProbLog program can be converted to an equivalent 
ground MLN. The above implies that no such conversion exists on the non-ground (first-order) 
level. 
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focus has been on the special case of MARG where there is only a single query atom 
Q and no evidence. This task is often called computing the success probability of Q 



(Dc Raedt et al. 2007). The only works related to the general MARG or MPE task 



in the PLP literature make a number of restrictive assumptions about the given 



program such as acyclicity (Gutmann et al. 2011) and the mutual exclusiveness 



assumption of PRISM (Sato and Kameya 2008). There also exist approaches that 



transform ground probabilistic programs to Bayesian networks and then use stan- 



dard Bayesian network inference procedures (Meert et al. 20091. However, these are 
also restricted to acyclic and already grounded programs. 

Our approach for the MARG and MPE inference tasks does not suffer from such 
restrictions and is applicable to all ProbLog programs. We now formally define these 
tasks, in addition to a third, strongly related task. Let At be the Herbrand base, i.e, 
the set of all ground (probabilistic and derived) atoms in a given ProbLog program. 
We assume that we are given a set E c At of observed atoms and a vector e with 
their observed truth values. We refer to this as the evidence and write E = e. Note 
that the evidence is essentially a partial interpretation of the atoms in the ProbLog 
program. 

• In the MARG task, we are given a set Q C At of atoms of interest, called 
query atoms. The task is to compute the marginal probability distribution 
of every such atom given the evidence, i.e. compute P(Q | E = e) for each 

qgqQ 

• The EVID or 'probability of evidence' task is to compute P(E = e). It 
corresponds to the likelihood of data in a learning setting and can be used as 



a building block for solving the MARG task (see Section 6.2). 
• The MPE task is to find the most likely interpretation (joint state) of all non- 
evidence atoms given the evidence, i.e. finding argmax u P(XJ — u | E = e), 
with U being the unobserved atoms, i.e., U = At \ E. 

As the following example illustrates, the different tasks are strongly related. 

Example 4 (Inference tasks) 

Consider the ProbLog program of Example [l] and assume that we know that John 
calls, so E = {calls(john)} and e = {true}. It can be verified that calls(john) 
is true in 6 of the 16 models of the program, namely the models of total choices 
1, 2, 5, 6, 9 and 10 of Table [I] The sum of their probabilities is 0.196, so this is 
the probability of evidence (EVID). The MPE task boils down to finding the world 
with the highest probability out of the 6 worlds that have calls(john) — true. It 
can be verified that this is the world corresponding to total choice 9, i.e., the choice 
{earthquake, hears-alarm(john), hears-alarm(mary)}. An example of the MARG 
task is to compute the probability that there is a burglary, i.e., P(burglary = true | 

77 / - L \ i \ P(burqlary—trueAcalls( John)— true) mi a j i t_* i 

calisliohn) = true) = — — J „, ,, , . . — ^—f- — t 1 '-. there are 4 models in which 

w ' * F\cails(jorin)=true) 

both calls(john) and burglary are true (models 1, 2, 5 and 6), and their sum of 

7 The common PLP task of computing the success probability of an atom Q is a special case of 
MARG with Q being the singleton {Q} and E = 0. 
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probabilities is 0.07. Hence, P(burglary = true \ calls(john) — true)— 0.07/ 0.196 
= 0.357. 

Our approach to inference consists of two steps: 1) convert the program to a 
weighted Boolean formula and 2) perform inference on the resulting weighted for- 
mula. We discuss these two steps in the next sections. 



5 Conversion to a Weighted Formula 

Our conversion takes as input a ProbLog program L, evidence E = e and a set 
of query atoms Q, and returns a weighted Boolean (propositional) formula that 
contains all necessary information. The conversion is similar for each of the consid- 
ered tasks (MARG, MPE or EVID). The only difference is the choice of the query 
set Q. For MARG, Q is the set of atoms for which we want to compute marginal 



probabilities. For EVID and MPE, we can take Q — (see Section 6.1.1). 
The outline of the conversion algorithm is as follows. 

1. Ground L yielding a program L g while taking into account Q and E = e (cf. 
Theorem U\ Section 5.1). 



It is unnecessary to consider the full grounding of the program, we only need 
the part that is relevant to the query given the evidence, that is, the part that 
captures the distribution P(Q | E = e). We refer to the resulting program L g 
as the relevant ground program with respect to Q and E = e. 
Convert the ground rules in L g to an equivalent Boolean formula tp r (cf. 
Lemma[T] Section 5.2). 



This step converts the logic programming rules to an equivalent formula. 
Assert the evidence and define a weight function (cf. Theorem [2l Section 5.3) 



To obtain the weighted formula, we first assert the evidence by defining the 
formula (p as the conjunction of the formula ip r for the rules (step2) and for 
the evidence (p e . Then we define a weight function for all atoms in ip. 

The correctness of the algorithm is shown below; this relies on the indicated theo- 
rems and lemma's. Before describing the algorithm in detail, we illustrate it on our 
Alarm example. 

Example 5 (The three steps in the conversion) 

As in Example El we take calls(john) — true as evidence. Suppose that we want 
to compute the marginal probability of burglary, so the query set Q is {burglary}. 
The relevant ground program is as follows. 

°L ground probabilistic facts 

. 1 : : burglary . . 2 :: earthquake . .7: :hears_alarm(jolm) . 

°L ground rules 

alarm : - burglary . 

alarm : - earthquake . 

calls(john) :- alarm, hears_alarm(john) . 
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Note that mary does not appear in the grounding because, if we have no evi- 
dence about her hearing the alarm or calling, she does not affect the probability 
P{burglary \ calls(john) = true). 

Step 2 converts the three ground rules of the relevant ground program to an equiv- 



alent prepositional formula ip r (see Section 5.2). This formula is the conjunction of 
alarm <-> burglary V earthquake and calls(john) -£-> alarm A hears_alarm(john)r\ 
Step 3 adds the evidence. Since we have only one evidence atom in our example 
(namely, calls(john) is true), all we need to do is to add the positive unit clause 
calls(john) to the formula ip r . The resulting formula ip is (p r A calls(john). Step 
3 also defines the weight function, which assigns a weight to each literal in ip, see 
Section |5.3| This results in the weighted formula, that is, the combination of the 
weight function and the Boolean formula ip. 

We now explain the three steps of the conversion in detail. 



5.1 The Relevant Ground Program 

In order to convert the ProbLog program to a Boolean formula we first ground it. 
We try to find the part of the grounding that is relevant to the queries Q and the 
evidence E = e. In SRL, this is also called knowledge-based model construction 



(Kersting and De Raedt 2001 ). To do this, we make use of the concept of a depen- 
dency set with respect to a ProbLog program. We first explain our algorithm and 
then show its correctness. 

The dependency set of a ground atom a is the set of all ground atoms that 
occur in some proof of a. The dependency set of multiple atoms is the union of 
their dependency sets. We call a ground atom relevant with respect to Q and E 
if it occurs in the dependency set of Q U E. We call a ground rule relevant if it 
contains only relevant atoms. It is safe to restrict the grounding to the relevant 
rules only. To find the relevant atoms and rules, we apply SLD resolution to prove 
all atoms in Q U E (this can be seen as backchaining over the rules starting from 
QUE). We employ tabling to avoid proving the same atom twice (and to avoid 
going into an infinite loop if the rules are cyclic) . The relevant rules are all ground 
rules encountered during the resolution process. As our ProbLog programs are 
range-restricted, all the variables in the rules used during the SLD resolution will 
eventually become ground, and hence also the rules themselves. 

The above grounding algorithm is not optimal as it does not make use of all 
available information. For instance, it does not make use of exactly what the evi- 
dence is (the values e), but only of which atoms are in the evidence (the set E). 
One simple, yet sometimes very effective, optimization is to prune inactive rules. 
We call a ground rule inactive if the body of the rule contains a literal I that is 
false in the evidence (/ can be an atom that is false in e, or the negation of an atom 
that is true in e). Inactive rules do not contribute to the semantics of a program. 



For subsequent steps, it is often convenient to write this formula in conjunctive normal form 
(CNF). For example, some knowledge compilation systems require CNF input. 
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Hence they can be omitted. In practice, we do this simultaneously with the above 
process: we omit inactive rules encountered during the SLD resolutionr] 

The result of this grounding algorithm is what we call the relevant ground pro- 
gram L g for L with respect to Q and E = e. It contains all the information necessary 
for solving the corresponding EVID, MARG or MPE task. The advantage of this 
'focussed' approach (i.e., taking into account Q and E = e during grounding) is 
that the program and hence the weighted formula becomes more compact, which 
makes subsequent inference more efficient. The disadvantage is that we need to redo 
the conversion to a weighted formula when the evidence and queries change. This 
is no problem since the conversion is fast compared to the actual inference (see 
Section [9}. 

The following theorem shows the correctness of our approach. 

Theorem, 1 

Let L be a ProbLog program and let L g be the relevant ground program for L with 

respect to Q and E = e. L and L g specify the same distribution P(Q | E = e). 

The proofs of all theorems in this paper are given in the appendix. 

We already showed the relevant ground program for the Alarm example in Ex- 
ample ^ (in that case, there were irrelevant rules about mary, but no inactive 
rules because there was no negative evidence) . To illustrate our approach for cyclic 



programs, we use the well-known Smokers example (Domingos et al. 2008) 



Example 6 {ProbLog program for Smokers) 

The ProbLog program for the Smokers example models two causes for people to 
smoke: either they spontaneously start because of stress or they are influenced by 
one of their friends. 

0.2: :stress(P) :-person(P). 

0.3: :influences(Pl,P2) :- f riend(Pl ,P2) . 

person(pl) . person(p2) . person(p3) . 

f riend(pl ,p2) . friend (pi ,p3) 

friend(p2,pl) . f riend(p3,pl) . 

smokes (X) :- stress (X). 

smokes(X) :- smokes(Y), influences (Y,X) . 

With the evidence {smokes(p2.) = true, smokes(p3) = false} and the query set 
{smokes(pi)}, we obtain the following ground program: 

0.2: :stress(pl) . .2: : stress(p2) . 0. 2 :: stress (p3) . 

0.3:: inf luences(p2,pl) . 0.3:: influences (pi ,p2) . 0.3:: influences (pi ,p3) 
°L irrelevant probabilistic fact !! 0. 3: : inf luences(p3,pl) . 

9 This deals with literals that are faJse in the evidence. Conversely, when a body of a ground 
rule contains a literal that is true in the evidence, it has to be kept and the rule cannot be 
simplified. The reason is that the atom's presence might give rise to a positive loop, which has 
to be detected during the conversion of the ground program to a Boolean formula in the next 
step. 
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smokes(pl) :- stress(pl). 

smokes (pi) :- smokes (p2) , inf luences(p2,pl) . 

°/ inactive rule !! smokes(pl) :- smokes(p3), inf luences(p3,pl) . 



smokes (p2) 
smokes (p2) 
smokes (p3) 
smokes (p3) 



stress (p2) . 

smokes (pi), inf luences(pl ,p2) . 

stress(p3) . 

smokes (pi), inf luences(pl ,p3) . 



The evidence smokes(p3) — false makes the third rule for smokes (pi) inactive. 
This in turn makes the probabilistic fact for inf luences(p3,pl) irrelevant. Nev- 
ertheless, the rules for smokes (p3) have to be in the grounding, as the truth value 
of the head of a rule constrains the truth values of the bodies. 



5.2 The Boolean Formula for the Ground Program 

We now discuss how to convert the rules in the relevant ground program L g to an 
equivalent Boolean formula (p r . Converting a set of logic programming (LP) rules 
to an equivalent Boolean formula is a purely logical (non-probabilistic) problem. 
This has been well studied in the LP literature, where several conversions have been 
proposed, e.g. Janhunen (2004[) . Note that the conversion is not merely a syntactical 



rewriting issue; the point is that the rules and the formula are to be interpreted 
according to a different semantics. Hence the conversion should compensate for this: 
the rules under LP semantics (with Closed World Assumption) should be equivalent 
to the formula under FOL semantics (without CWA). 

For acyclic rules, the conversion is straightforward, we can simply take Clark's 



completion of the rules (Lloyd 1987 Janhunen 2004 1. We illustrate this on the 



Alarm example, which is indeed acyclic. 

Example 7 (Formula for the alarm rules) 

As shown in Example [51 the grounding of the Alarm example contains two rules for 
alarm, namely alarm :- burglary and alarm :- earthquake. Clark's comple- 
tion of these rules is the propositional formula alarm <-> bur glaryV earthquake, i.e., 
the alarm goes off if and only if there is burglary or earthquake. Once we have the 
formula, we often need to rewrite it in CNF form, which is straightforward for a com- 
pletion formula. For the completion of alarm, the resulting CNF has three clauses: 
alarm V -^burglary, alarm V ^earthquake, and -^alarm V burglary V earthquake. 
The last clause reflects the CWAE3 

For cyclic rules, the conversion is more complicated. This holds in particular 
for rules with positive loops, i.e., loops with atoms that depend positively on each 
other, as in the recursive rule for smokes/1. It is well-known that in the presence 



10 The Alarm example models a Bayesian network for the MARG task. For Bayesian networks, 
the problem of c onversion to a weighted CNF form ula has been considered before, and several 
encodings exist 
Bayesian network 



conversion to a weigmeu oint lornima nas ucen consiucreu oeiore, anu several 

f Darwiche 2009| |Sang et al. 2 005 ) . For ProbLog progra ms modelling Bool ean 
:s, like Alarm, our (JNF encoding coincides with that oflSang et al. (20051. 
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of positive loops, Clark's completion is not correct, i.e. the resulting formula is not 
equivalent to the rules ( Janhunen 2004). 



Example 8 {Simplified Smokers example) 

Let us focus on the Smokers program of Example |6j but restricted to person pi 

and p2. 

0.2: : stress (pi) . 0.3: : influences (p2, pi) . 

0.2: : stress (p2) . 0.3: : influences (pi ,p2) . 



smokes (pi) 
smokes (pi) 
smokes (p2) 
smokes (p2) 



stress(pl) . 

smokes(p2), inf luences(p2,pl) . 

stress(p2) . 

smokes (pi), inf luences(pl ,p2) . 



Clark's completion of the rules for smokes (pi) and smokes (p2) would result in a 
formula which has as a model {smokes(pi),smokes(p2),—<stress(j)i),^stress(p2), 
influences(pi,p2),influences(p2,pi), . . .}, but this is not a model of the ground 
ProbLog program: the only model resulting from the total choice {^stress(pi), 
->siress(p2), influences(px,p2), influences(p2,pi), . . .}, is the model in which smokes(pi) 
and smokes{p2) are both false. 

Since Clark's completion is inapplicable with positive loops, a range of more 
sophisticated conversion algorithms have been developed in the LP literature. Since 
the problem is of a highly technical nature, we are unable to repeat the full details 
in this paper. Instead, we briefly discuss the two conversion methods that we use 
in our work and refer to the corresponding literature for more details. 

Both conversion algorithms take a set of rules and construct an equivalent for- 
mula. The formulas generated by the two algorithms are typically syntactically 
different because the algorithms introduce a set of auxiliary atoms in the formula 
and these sets might differ. For both algorithms, the size of the formula typically 
increases with the number of positive loops in the rules. The two algorithms are 
the following. 



The first algorithm is from the Answer Set Programming literature (Jan 



hunen 2004). It first rewrites the given rules into an equivalent set of rules 
without positive loops (all resulting loops involve negation) . This requires the 
introduction of auxiliary atoms and rules. Since the resulting rules are free of 
positive loops, they can be converted by taking Clark's completion. The result 
can then be written as a CNF. This algorithm is rule based, as opposed to 
the next algorithm. 



The second algorithm was introduced in the LP literature ( Mantadelis and 



Janssens 2010) and is proof-based. It first constructs all proofs of all atoms 
of interest, in our case all atoms in QUE, using tabled SLD resolution. 
The proofs are collected in a recursive structure, namely a set of nested tries 



(Mantadelis and Janssens 2010), which will have loops if the given rules had 
loops. The algorithm then operates on this structure in order to 'break' the 
loops and obtain an equivalent Boolean formula. This formula can then be 
written as a CNF. 



16 D. Fierens et al. 

Both the rule-based and the proof-based conversion algorithm return a formula 
that is 'equivalent' to the rules in L g , in the sense of the following lemma. 

Lemma 1 

Let L g be a ground ProbLog program. Let (p r denote the formula derived from the 

rules in L g . Then SAT((f r ) = MOD(L g ). 

Recall that MOD(L g ) denotes the set of models of a ProbLog program L g , as 
defined in Section [3T2] On the formula side, we use SAT((p r ) to denote the set of 
models of a formula yypj 

Example 9 (Boolean formula for the simplified Smokers example) 
Consider the ground program for the simplified Smokers example, given in Ex- 
ample [8J The proof-based conversion algorithm converts the ground rules in this 
program to an equivalent formula (in the sense of Lemma fTl) consisting of the con- 
junction of the following four subformulas. 

smokes(pi) <-> auxi V stress (pi) 

smokes(p2) ■<-> aux2\/ stress (pi) 

auxi <-> smokes{jp2.) A influences(p2,pi) 

aux2 -H> stress(pi) A influences(pi,p2) 

Here auxi and aux2 are auxiliary atoms that are introduced by the conversion 
(though they could be avoided in this case). Intuitively, auxi says that person pi 
started smoking because he is influenced by person p2, who smokes himself. Note 
that while the ground program (in Example pi) is cyclic, the loop has been broken 
by the conversion process; this surfaces in the fact that the last subformula uses 
stress (pi) instead of smokes(pi). 

5.3 The Weighted Boolean formula 

The final step of the conversion constructs the weighted Boolean formula starting 
from the Boolean formula for the rules (p r . First, the formula if is defined as the 
conjunction of <p r and a formula ip e capturing the evidence E = e. Here <p e is a 
conjunction of unit clauses: there is a unit clause a for each true atom and a clause 
-^a for each false atom in the evidence. Second, we define the weight function for all 
literals in the resulting formula. The weight of a probabilistic literal is derived from 
the probabilistic facts in the program: if the relevant ground program contains a 
probabilistic fact p: :f , then we assign weight p to f and weight 1 — p to ->f. The 
weight of a derived literal (a literal not occuring in a probabilistic fact) is always 1. 
The weight of a world ui, denoted w(u), is defined to be the product of the weight 
of all literals in ui. 

11 Both conversions for cyclic rules introduce additional or 'auxiliary' atoms into ip r . We can safely 
omit these atoms from the models in SAT(ifi r ) because both conversions are 'faithful', so the 
truth value of auxiliary atoms is uniquely defined by the truth value of the original atoms. This 
means that the introduction of the auxiliary atoms does not create extra models. Hence, w.r.t. 
the original atoms we have the stated equivalence: SAT(ip r ) = MOD(L g ). W.r.t. all atoms, ip r 
and L g are equisatisfiable. 
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Example 10 ( Weighted formula for Alarm) 

We have seen the formula for the Alarm program in Example [7] If we have evidence 
that calls(john) is true, we add a positive unit clause calls(john) to this formula 
(after doing this, we can potentially apply unit propagation to simplify the formula). 
Then we define the weight function. The formula contains three probabilistic atoms 
burglary, earthquake and hears_alarm(john) . The other atoms in the formula, 
alarm and calls ( j ohn) , are derived atoms. Hence the weight function is as follows. 

burglary i — >• 0.1 ^burglary i— > 0.9 

earthquake i— > 0.2 —^earthquake t— > 0.8 

hears _alar m(j ohn) \-+ 0.7 -^hear s jalarm(j ohn) h-> 0.3 

alarm h-> 1 ->alarm *— > 1 

calls{john) t— > 1 ^calls(john) t— > 1 

We have now seen how to construct the entire weighted formula from the rele- 
vant ground program. The following theorem states that this weighted formula is 
equivalent - in a particular sense - to the relevant ground program. We will make 
use of this result when performing inference on the weighted formula. 

Theorem 2 

Let L g be the relevant ground program for some ProbLog program with respect to 
Q and E = e. Let MOD-E =e {L g ) be those models in MOD(L g ) that are consistent 
with the evidence E = e. Let ip denote the formula and w(-) the weight function of 
the weighted formula derived from L g . Then: 

- (model equivalence) SAT(ip) = MOD^, =e {L g ), 

- (weight equivalence) Vcj G SAT(ip): w(u) — Pl (lu), i.e., the weight of ui 
according to w(-) is equal to the probability of uj according to L g . 

Note the relationship with Lemma nl (p. 16): Lemma n] applies to the formula ip r 
prior to asserting the evidence, whereas Theorem [2] applies to the formula ip after 
asserting evidence. 

Example 11 {Equivalence of weighted formula and ground program) 
The ground Alarm program of Example [5] has three probabilistic facts and hence 
2 3 = 8 total choices and corresponding possible worlds. Three of these possible 
worlds are consistent with the evidence calls(john) — true, namely the worlds 
resulting from choices in which hear S-alarm( John) is always true and at least one 
of {burglary, earthquake} is true. The reader can verify that the Boolean formula 
constructed in Example [10] has exactly the same three models, and that weight 
equivalence holds for each of these models. 

There is also a link between the weighted formula and Markov Logic Networks 



(MLNs). Readers unfamiliar with MLNs can consult Appendix B The weighted 
formula that we construct can be regarded as a ground MLN. The MLN contains 
the Boolean formula as a 'hard' formula (with infinite weight). The MLN also has 
two weighted unit clauses per probabilistic atom: for a probabilistic atom a and 
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weight function {a i— > p,^a n- 1 — p}, the MLN contains a unit clause a with 
weight ln(p) and a unit clause ~^a with weight ln(l — p)Fj 

Example 12 {MLN for the Alarm, example) 

The Boolean formula ip for our 'Alarm' running example was shown in Example [5] 

The corresponding MLN contains this formula as a hard formula. The MLN also 

contains the following six weighted unit clauses. 

ln(O.l) burglary ln(0.9) -^burglary 

ln(0.2) earthquake ln(0.8) ^earthquake 

ln(0.7) hears -alar m(john) ln(0.3) -shears -alarm(john) 

We have the following equivalence result. 

Theorem 3 

Let L g be the relevant ground program for some ProbLog program with respect to 
Q and E = e. Let M. be the corresponding ground MLN. The distribution P(Q) 
according to M. is the same as the distribution P(Q | E = e) according to L g . 

Note that for the MLN we consider the distribution P(Q) (not conditioned on the 
evidence). This is because the evidence is already hard-coded in the MLN. 



6 Inference on the Weighted Formula 

To solve the given inference task for the probabilistic logic program L, the query Q 
and evidence E = e, we have converted the program to a weighted Boolean formula. 
A key advantage is that the inference task (be it MARG, MPE or EVID) can now 
be reformulated in terms of well-known tasks such as weighted model counting or 
weighted MAX-SAT on the weighted formula. This implies that we can use any of 
the existing state-of-the-art algorithms for solving these tasks. In other words, by 
the conversion of ProbLog to weighted formula, we get the inference algorithms for 
free. 



6.1 Task 1: Computing the probability of evidence (EVID) 

Computing the probability of evidence reduces to weighted model counting (WMC), 
a well-studied task in the SAT community. Model counting for a propositional 
formula is the task of computing the number of models of the formula. WMC is 
the generalization where every model has a weight and the task is to compute the 
sum of weights of all models. The fact that computing the probability of evidence 
P(E = e) reduces to WMC on our weighted formula can be seen as follows. 

P(E = e) = ]T P L {w) = Y, ™H 

u>£MOD E=B (L) uieSAT(ip) 



12 The values of the logarithms (and hence the weights) are negative, but any M LN with negative 
weigh ts can be rewritten into an equivalent MLN with only positive weights pomingos et ah] 
[2008) ■ 
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The first equality holds because ^(E = e) by definition equals the total proba- 
bility of all worlds consistent with the evidence. The second equality follows from 
Theorem [21 model equivalence implies that the sets over which the sums range 
are equal, weight equivalence implies that the summed terms are equal. Comput- 
es J2uiesAT(<p) w ( w ) is exactly what WMC on the weighted formula ip does. It is 



well-known that inference with Bayesian networks can be solved using WMC ( Sang 



et al. 2005). In (Fierens et al. 2011) we were the first to point out that this also 
holds for inference with probabilistic logic programs. As we will see in the experi- 
ments, this approach improves upon state-of-the-art methods in probabilistic logic 
programming. 

The above leaves open how we solve the WMC problem. There exist many ap- 



proaches to WMC, both exact (Darwiche 2004) and approximate (Gomes et al. 



2007). An approach that is particularly useful in our context is that of knowledge 



compilation, 'compiling' the weighted formula into a more 'efficient' form. While 



knowledge compilation has been studied for many different tasks (Darwiche and 



Marquis 2002), we need a form that allows for efficient WMC. Concretely, we com- 



pile the weighted formula into a so-called arithmetic circuit ( Darwiche 2009[) , which 



is closely linked to the concept of deterministic, decomposable negation normal form 



(d-DNNF) ( [Darwiche 2004 ) 



6.1.1 Compilation to an Arithmetic Circuit via d-DNNF 

We now introduce the necessary background on knowledge compilation and illus- 
trate the approach with an example. 

Knowledge compilation is concerned with compiling a logical formula, for which a 
certain family of inference tasks is hard to compute, into a representation where the 
same tasks are tractable (so the complexity of the problem is shifted to the compila- 
tion phase). In this case, the hard task is to compute weighted model counts (which 
is #P-complete in general). After compiling a logical formula into a deterministic, 



decomposable negation normal form circuit (d-DNNF) representation (Darwiche 



2004) and converting the d-DNNF into an arithmetic circuit, the weighted model 
count of the formula can efficiently be computed, conditioned on any set of evidence. 
This allows us to compile a single d-DNNF circuit and evaluate all marginals effi- 
ciently using this circuit. 

A negation normal form formula (NNF) is a rooted directed acyclic graph in 
which each leaf node is labeled with a literal and each internal node is labeled 
with a conjunction or disjunction. A decomposable negation normal form (DNNF) 
is a NNF satisfying decomposability: for every conjunction node, it should hold 
that no two children of the node share any atom with each other. A deterministic 
DNNF (d-DNNF) is a DNNF satisfying determinism: for every disjunction node, all 
children should represent formulas that are logically inconsistent with each other. 
For WMC, we need a d-DNNF that also satisfies smoothness: for every disjunction 
node, all children should use exactly the same set of atoms. Compiling a Boolean 
formula to a (smooth) d-DNNF is a well-studied problem, and several compilers are 



available (Darwiche 2004 Muise et al. 2012). These circuits are the most compact 
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circuit language we know of today that supports tractable WMC (Darwiche and 



Marquis 2002 ). 

A d-DNNF is a purely logical construct. It is constructed by compiling the for- 
mula, irrespective of the associated weighting function. Hence a d-DNNF allows for 
model counting, but not for WMC. In order to do WMC, we need to convert the 
d-DNNF into an arithmetic circuit, by taking into account the weighting function 



of our weighted formula. This conversion is done in two steps (Darwiche 2009): 
1) replace all conjunctions in the internal nodes by multiplications, and all disjunc- 
tions by summations, 2) replace every leaf node involving a literal I by a subtree 
consisting of a multiplication node having two children, namely a leaf node with 
an indicator variable for the literal I and a leaf node with the weight of / according 
the weighted formula. We now illustrate this for the Alarm example. 

Example 13 (d-DNNF and Arithmetic Circuit for the Alarm example) 

We continue the Alarm example (Example |10[ ). The formula for this example, under 

the evidence calls(john) — true, is the conjunction of the following three subfor- 

mulas. 

alarm -H- burglary V earthquake 
calls(john) -f-> alarm, hears-alarm(john) 
calls(john) 

A corresponding d-DNNF is shown in Figure [T^,. Note that the AND-nodes in the 
d-DNNF (like the root note) indeed satisfy the property of decomposability; while 
the OR-nodes satisfy determinism. The function of the OR-node on the lower-right 
is to make the d-DNNF smooth. 

The arithmetic circuit corresponding to this d-DNNF is shown in Figure [T]d. The 
values in brackets in the interal nodes will be used later and can be ignored for 
now. The A-variables in the leaves are the indicator variables for the literals. The 
indicator variable for a literal I is multiplied with a number, which is the weight of 
I according to our weighting function. 

Now that we have an arithmetic circuit for our weighted formula, we are ready to 
perform WMC and compute the weighted model count Y^u^sat( v ) w ( uj )- This count 
is found by simply evaluating the arithmetic circuit: we instantiate all indicator 
variables to the value 1 and then bottom-up evaluate all nodes, until we arrive at 
the root node. The value found at the root is the desired weighted model count and 
also equals the probability of the evidence P(E = e). 

Example 14 (Evaluating the arithmetic circuit for the Alarm example) 
We use the arithmetic circuit for the Alarm program given in Example |13| Re- 
call that this program and circuit were obtained using calls(john) — true as 
the evidence, so we can use this circuit to calculate the probability of evidence 
P(calls(john) = true). This is done by instantiating all indicator variables A to 
1, and then evaluting the circuit. Figure [T|d illustrates this: the obtained values in 
each node are given between brackets. The value for the root is 0.196. This is the 
probability of evidence. 
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calls(john) hears_alarm(john) ( OR ) alarm 




earthquake -earthquake 



(a) d-DNNF 



A[calls(john)J 1.0 A[hears_alarm(john)J 0.7 




A[earthquakeJ 0.2 A[-earthquakeJ 0. 



(b) arithmetic circuit 

Fig. 1. The d-DNNF for the Alarm example and the corresponding arithmetic 
circuit. 

The above does not explain why we really need the indicator variables. The 
indicator variables allow us to add further evidence, on top of E = e, which is 
useful for MARG inference as we will see later. For instance, we can compute 
P(E = e A X = true), for some additional atom X in the arithmetic circuit, 
by setting the indicator variable X[X] to 1 and A[— X] to when evaluating the 
circuit F3 



13 In a purely logical context, setting indicator variables to corresponds to conditioning the 
d-DNNF circuit. 
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Example 15 {Evaluating the arithmetic circuit in case of additional evidence) 
Assume we want to compute P{calls{john) = true A earthquake — true), using the 
same arithmetic circuit seen before, namely the circuit for calls(john) — true. Since 
we additionally have earthquake — true, we set X[earthquake] to 1, A[— earthquake] 
to 0, and all other indicator variables to 1 as before. The evaluation is illustrated 
in Figure [2j yielding the result 0.14. Hence P(calls(john) — true A earthquake — 
true) = 0.14. 

In the same way, the probability of any set of evidence can be computed, provided 
that this set extends the initial set E = e (and that the additional atoms also appear 
in the compiled circuit) . This also means that Step 3 of our conversion algorithm 



(Section 5.3), where we add the evidence tp e to the weighted Boolean formula, is 
not strictly needed: we can achieve the same result by using only the formula ip r 
(capturing the rules of the program) and setting the indicator variables in the circuit 
according to the evidence E = e. However, asserting the evidence ip e early makes 
the compilation phase more efficient (it allows for more unit propagation, etc). 



A]ca!lsljolin)]= 1 



A[-burglary] = 1 




A[enrlhquake] = I 



AL-earthquake] - 




0.8 



Fig. 2. Evaluating an arithmetic circuit with additional evidence (the nodes which 
get a different value than in Figure flTb) are highlighted in boldface) . 



In SRL, the work of Chavira et al. ( 2006 ) is closest to the approach in this section. 



They perform inference in relational Bayesian networks by encoding them into a 
weighted Boolean formula and compiling this formula into an arithmetic circuit. 
The main difference is that relational Bayesian networks are not a programming 
language and assume acyclicity. That assumption greatly simplifies the step of con- 
verting to a weighted Boolean formula (cf. Section [5]). 

In summary, to compute the probability of evidence we 1) compile the formula 
to a d-DNNF, 2) convert the d-DNNF into an arithmetic circuit, 3) evaluate the 
arithmetic circuit. 
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6.1.2 Compilation to an Arithmetic Circuit via BDD 



In the probabilistic logic programming (PLP) community, the state-of-the-art (De Riedt 



et al. 2007 ) is to compile the program into another form, namely a reduced ordered 



Binary Decision Diagram (BDD) (Bryant 1986). This approach is a special case 



of our above WMC approach (although it is usually not formulated like that; in 



fact, in Fierens et al. (2011) we were the first to point out the connection of the 
PLP-BDD approach to WMC). 

A BDD is a special kind of d-DNNF, namely one that satisfies the additional 



properties of ordering and decision, see Darwiche (2004 ) . In our approach, we can 



alternatively replace the d-DNNF compiler by a BDD compiler. Computing the 
probability of evidence can then be done by either operating directly on the BDD, 
or by converting the BDD to an arithmetic circuit and evaluating the circuit (the 
first approach is merely a reformulation of the second) . So while both compilation 
to BDD and d-DNNF are possible, there is theoretical and empirical evidence in 



the model counting literature that d-DNNFs outperform BDDs ( |Darwiche 2004 ). 
Our experimental results confirm the superiority of d-DNNFs (Section [9]). 

We have now seen two ways of computing the probability of evidence: via d- 
DNNFs or BDDs. We will now see how this approach for computing the probability 
of evidence can be used as a building block for the MARG inference task (as is 
standard in the probabilistic literature). 



6.2 Task 2: Computing marginal probabilities (MARG) 

In MARG, we are given a set of query atoms Q and for each Q G Q we need to 
compute P(Q | E = e). By definition P(Q | E = e) = p(E=e) ■ Hence, if we have 
N atoms in the query set Q, solving MARG reduces to computing the probability 
of the evidence, and computing N probabilities of the form P(QAE = e), i.e., the 
probability of the conjunction of the evidence with a single atom. In the previous 
section, we have already seen how we can compute such probabilities from the 
compiled arithmetic circuit, by appropriately instantiating the indicator variables 
A and evaluating the circuit. The simplest approach is to apply this once for each 
query atom Q £ Q separately. However, we can solve this even more efficiently. 
Concretely, all required probabilities can be found in parallel. To be precise, all 
probabilities of the form P(IAE = e), with X being any atom in the circuit, 
can be computed simultaneously by traversing the circuit twice (bottom-up and 
top-down). The required traversal algorithm can be found in the literature, see 
Algorithm 34 (simple version) and 35 (optimized version) in Darwiche (2009[) . From 



this, we obtain all probabilities of the form P(X AE = e). We then retain those that 
involve an atom from the query set (X G Q) and compute the required conditional 
probabilities P(Q | E = e) as p(E=e) ■ ^ s m ^ e P rev i° us section, this entire 
approach can be performed using an arithmetic circuit derived from a compiled 
d-DNNF or from a BDD. 

The knowledge compilation approach is typically used for exact inference. When 
dealing with large domains, we often need to resort to computing approximate 
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marginals. Approximate inference is often achieved by means of sampling tech- 
niques, such as Markov Chain Monte Carlo (MCMC). Standard MCMC approaches 
like Gibbs sampling cannot deal with weighted formulas because the formula itself 
is deterministic. Instead, we use the MC-SAT algorithm that was developed specif- 



ically to deal with determinism (Poon and Domingos 2006). MC-SAT is an MCMC 



algorithm that in every step of the Markov chain calls a SAT solver to construct a 
new sample. MC-SAT takes an MLN as input. Theorem [3] ensures that if we apply 
MC-SAT on the appropriate MLN, we indeed obtain samples from the distribution 
P(Q|E = e). 

To summarize, we currently have three methods for the MARC task: exact infer- 
ence by compilation to 1) d-DNNFs or 2) BDDs, or 3) approximate inference with 
MC-SAT. 



6.3 Task 3: Finding the most likely explanation (MPE) 

MPE is the task of finding the most likely interpretation (joint state) of all un- 
observed atoms given the evidence, i.e. finding argmax u P(TJ — u | E = e), with 
U all unobserved atoms (i.e, all atoms in the ground program that are not in E). 
MPE inference on weighted formulas has been studied before. We consider two 
approaches. 

The first approach is to perform MPE by means of knowledge compilation. The 
compilation step (to compile an arithmetic circuit via a d-DNNF or BDD) is the 
same as before, only the traversal step differs [j_j Again, the traversal algorithm can 
be found in the literature, see Algorithm 36 in Darwiche (2009[ ). This yields the 
exact MPE solutionF£] 



The second approach is to perform MPE using techniques from the SAT solving 
community. Concretely, it is known that MPE reduces to partially weighted MAX- 



SAT (Park 2002). A popular approximate approach for solving this task is stochastic 



local search (Park 2002[). An example algorithm is MaxWalkSAT, which is also the 



standard MPE algorithm for MLNs (Domingos et al. 2008) 



Since our current ProbLog implementation focusses on MARG inference rather 
than MPE, we do not discuss these approaches in detail and will not consider them 
further in this paper. 



14 For the MPE task, it is sufficient to compile into a DNNF circuit, which is not necessarily deter- 
ministic. DNNF circuits are potentially more succinct than d-DNNF circuits, but unfortunately 
there exist no compilers specifically for DNNF. 

15 This approach yields the truth value of all ground atoms that occur in the relevant ground 
program (RGP) for the given evidence. All probabilistic atoms that do not occur in the RGP are 
irrelevant w.r.t. the evidence (i.e., they are probabilistically independent of the evidence). Hence, 
for each of these atoms, we can simply independently chose the truth value with maximum 
probability according to the associated probabilistic fact. The truth value of all derived atoms 
that do not occur in the RGP is then found by computing the well-founded model of the MPE 
total choice. 
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7 Learning Probabilistic Logic Programs from Partial Interpretations 

We now present an algorithm for learning the parameters (the probabilities of the 
probabilistic facts) of a ProbLog program from data. We use the learning from, 
interpretations (LFI) setting. 

7.1 The Learning Setting 

Learning from (possibly partial) interpretations is a common setting in statistical 
relational learning, which has so far not yet been studied in its full generality for 



probabilistic programming languages (but see also|Gutmann et al. (20111) 



In the terminology used for inference in Section HI partial interpretations corre- 
spond to evidence, and hence, in this section we shall often use the term evidence 
instead of partial interpretation. Let At be the Herbrand base, i.e., the set of all 
ground (probabilistic and derived) atoms in a given ProbLog program. In the fully 
observable case, we learn from a set of complete interpretations, that is, the observed 
truth- values e of all the atoms in the Herbrand base At are given and the evidence 
variables E coincide with At. On the other hand, in the partially observable case, 
we learn from a set of partial interpretations, that is, we only observe the truth- 
values e of a set E C At of observed atoms. We now develop an algorithm, called 
LFI-ProbLog, that learns from (possibly partial) interpretations of a ProbLog pro- 
gram. In a generative setting, one is typically interested in the maximum likelihood 
parameters given the training data. This can be formalized as follows. 

Given: 

• a ProbLog program T p containing a set of rules R and a set of probabilistic 
facts F — {pi :: fc} with unknown parameters p = (pi, . . . , p^) 
• a set of (possibly partial) interpretations D = {Ei = ei, . . . , Em = ejvi} 
(the training examples) 
Find: the maximum likelihood probabilities p = (pi, . . . ,pn), that is, 



p = arg max P Tp (Z?) = argmax J| PT p (E m = e 



M 

■c 
p 



7n— 1 



where Pt (E m = e m ) is the probability of evidence E m = e m in the ProbLog 
program T p with parameters p. 

Example [16] illustrates the LFI setting using the Alarm program from Example [T] 

Example 16 {Learning From Interpretations) 

PI : :burglary . person(mary) . alarm:- burglary. 

P2 :: earthquake . person(john) . alarm:- earthquake. 

P3 : :hears_alarm(X) :- person(X) . calls(X) :- alarm, hears_alarm(X) . 

A ProbLog program is given in which the probabilities PI, P2 and P3 are un- 
known and should be learned from partial interpretations, which contain the truth 
value for some of the atoms: {alarm = true} , {earthquake — true, calls (mary) — 
true}, {calls(john) — true}. The goal is to find the probabilities PI, P2 and P3 such 
that the combined probability of the partial interpretations is maximal. 
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One has to consider two cases when computing the maximum likelihood param- 
eters p. In the fully observable case where the truth values for each of the atoms in 
the Herbrand base is known, one can obtain p by counting. In the more complex 
case of partial interpretations, one has to use an approach such as Expectation 
Maximization to deal with the partial observability. 



7.2 Full Observability 

In the fully-observable case, the maximum likelihood estimate p~^ for a probabilistic 
fact p n :: /„ can be calculated directly from the interpretations. When p n :: f n 
is intensional, it represents multiple ground instances, that is, probabilistic facts: 
Pn '■'■ fn.ii ■ ■ ■ ■> Pn '■'■ fn,K m where K™ is the number of ground instances represented 
by the fact p n :: f n in interpretation E m = e m . When p n :: /„ is ground and 
extensional, K™ is equal to 1 and the fact represents itself only. The maximal 
likelihood estimates can be calculated using the following formula. 

-, M K™ ( 1 .„ , , _ 

— 1 ^V^rm v, xm J l li M,k = true € E m = e m 

" n = ^£S"* fc "'* 1° if/^=/«^€E m = e m (1) 

The sum is normalized by Z n — YL m =i ^™' ^ ne total number of probabilistic 
facts represented by /„ in all training examples. When Z n is in the data, p~Z is 
not calculated (there is no data to estimate it from). 



1.3 Partial Observability 

In many applications the training examples are only partially observed. In the 
alarm example, we may receive a phone call but we may not know whether an 
earthquake has occurred. In the partially-observable case - similar to Bayesian 
networks - it is impossible to compute the maximum likelihood estimates in closed- 
form. Instead, we use the Expectation Maximization (EM), see AlgorithmfTJ In this 
algorithm, the parameters p„ are initialized randomly. During each iteration i, the 
ProbLog program T p . with parameters p* is used to estimate the probability of 
the unobserved atoms being true in each interpretation, Pt i(fn.k |E m = e m ) (the 
expectation step). These expectations are then used as to update the parameters 
of the program using the following equation (the maximization step). 

m k;» 

Pn +1 ^EE P ^ (/«.*I E - = ^ ( 2 ) 

771 — 1 k — 1 

Algorithm [T] uses the inference mechanism described in Sect ion [6. 2 1 for computing 
the marginals in the expectation step. We can make two optimizations. Firstly, for 
the facts /„.& that are not contained in the dependency set of a partial interpretation 
E m = e m , the probability Pt j(/n,fc|E m = e m ) is equal to p l n . These facts slow 
down the updating process and should therefore not be included in the sum. This 
can be realized by compiling the d-DNNF for the query Pt , (E m = e m ) and to use 
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the resulting d-DNNFs to compute the marginal probabilities Pt , (/n,fc|E m = e m ) 
only for those facts /„,& included in the d-DNNF. For example, when we compile the 
d-DNNF for the third partial interpretation of Example [THJ we obtain the ground 
program from Example [5] and the d-DNNF from Figure [lp.. This d-DNNF does not 
contain calls (mary) so this atom will not be used to update the probabilities for 
the third partial interpretation. When no groundings for a learnable fact are present 
in any of the d-DNNFs, a zero probability is learned as no information is given. 
Secondly, one can observe that changing the parameters of a ProbLog program 
does not change the structure of the compiled d-DNNFs. This means that the d- 
DNNFs that have been compiled in the first iteration can be reused in all further 
iterations. The algorithm keeps on updating the parameters until the log likelihood 
of the interpretations is maximal. Each iteration of the algorithm is guaranteed to 
improve the likelihood of the data. 

Algorithm 1 The main loop of LFI-ProbLog. The ProbLog program is compiled 
into a d-DNNF for each partial interpretation E m = e m . After the compilation 
step, the algorithm follows an EM update scheme, first using the current model to 
complete the data and then estimating the new model parameters from the resulting 
counts until convergence. 
1: function LFI-ProbLog(T = { Pl :: fi,...,p N ■■ /jv} U R,D = {Ei = 

ei, . . . , Em = eM}) 
2: for 1 < n < N do 

3: p° <— rand(0, 1) > The fact probabilities are initialized with a random 

probability 

for 1 < m < M do t> Loop over training examples 

d-DNNF m «- COMPiLE(P To (E m = e m )) 

while not converged do > EM algorithm 

i <- i + 1 

for 1 < m < M do 
for 1 < n < N do 

for 1 < k < K™ do 

compute PT,_i(/n,fc|E m ) using d-DNNF m > E Step 

for 1 < n < N do > Loop over probabilistic facts 

Pn <" i E™=i EE Pt^ (/n,fe|E m ) > M Step (cf. Eq. g 

return {p* :: /„ | /„ G F} U R 
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The learning algorithm uses a black box for the MARG inference task (line 12 ). In 
principle, any inference algorithm will work, including approximate ones. However, 
by choosing knowledge compilation for inference, we need to compile a circuit only 
once for each training example. This is the hard task. Once we have a circuit, 
computing expectations becomes easy, and we can reuse the circuit many times, 
for all i, k and n in lines |6} |T0| and \TT\ of Algorithm [T] Furthermore, all marginal 
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probabilities Pt (/n,/c|E m ) for the same evidence set E m and parameterization p 
can be computed at once, in only two passes over the d-DNNF circuit. 

7-4 Discussion 

The above learning algorithm is based on the LFI algorithm that we developed 
in earlier work (Gutmann et al. 2011), see the discussion in Section fl] Our new 



algorithm has some advantages over the earlier version. First, the new algorithm 
can deal with cyclic programs (the old algorithm uses Clark's completion, which 



applies only to acyclic programs, cf. Section 5.2 1. Second, the new algorithm scales 
better as it employs a more efficient approach for inference in the expectation 
step, namely compilation to d-DNNFs instead of to BDDs (see the experiments 



in Section 9.3.31. Furthermore, the new description of the algorithm more clearly 
separates the learning from the inference steps. 

The complexity of parameter learning (and of MARG and MPE inference) is 



worst-case exponential in the treewidth (Robertson and Seymour 1986p of the 



weighted Boolean formula when using knowledge compilation to d-DNNF (Dar 



wiche 2001). This theoretical complexity bound is in line with the complexity of 
classical algorithms for inference and learning in probabilistic graphical models. For 
example, hidden Markov models have a constant treewidth in terms of the number 
of time steps considered. Learning the parameters of these models is linear in the 
number of time steps, both when using LFI-ProbLog with d-DNNF compilation 
and, for example, expectation maximization with the classical junction tree algo- 
rithm. These bounds assume that both algorithms succeed at finding the optimal 
tree decomposition of the model, which itself is a hard task in theory. In practice, 
however, there exist heuristics that can find good tree decompositions of many 
different kinds of models. 



8 Implementation of the System ProbLog2 



The first ProbLog system (Kimmig et al. 2010[) focused on the inference task of 



computing the success probability of a single atom (Section El) and on learning from 



entailment (Gutmann et al. 2008b). ProbLog2, the new ProbLog system described 
in this paper, focusses on different tasks, namely computing marginal probabilities 
and the probability of evidence, as well as learning from interpretations. This new 
setting is closer in spirit to the work on graphical models and Statistical Relational 
Learning (like Markov Logic). As a consequence of this new emphasis, the design 
of ProbLog2 is quite different from that of the first ProbLog. The implementation 



of the first ProbLog was tightly integrated in YAP Prolog (Kimmig et al. 2010). In 
contrast, ProbLog2 consists of a number of relatively loosely-coupled components, 
and involves almost no Prolog code. This new design is closer in spirit to that of 
some Answer Set Programming systems than to Prolog. 

We now briefly discuss the different components of the implementation. Most of 
these components are existing state-of-the-art programs, rather than being tailor- 
made for ProbLog. 
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The grounding component computes the relevant ground program from the 
given ProbLog program (and query and evidence). This is the only compo- 
nent that is written in (YAP) Prolog. It is essentially a meta-interpreter that 



collects proofs to construct the dependency set (Section 5.1). 

The conversion component converts the rules in the relevant ground program 
to a Boolean (CNF) formula. The user can choose between the proof-based 



and the rule-based conversion (Section 5.2). For the proof-based conversion 



(Mantadelis and Janssens 2010), we use our own implementation. For the 



rule-based conversion, we use the code of Janhunen (2004), as used in the 
Answer Set Programming community. 

The exact inference component is based on knowledge compilation and con- 
sists of two parts: a compiler and an evaluation algorithm. For compilation to 



d-DNNF, the user can choose between the 'c2d' compiler by Darwiche (2004) 



or the more recent 'DSHARP' compiler ( Muise et al. 2012| ) | 16 | For compilation 
to a BDD, we use CUDD (see http : //vlsi . Colorado . edu/~f abio/CUDD/). 
For constructing and evaluating the corresponding arithmetic circuit we use 
our own code. 

The approximate inference component converts the weighted formula to a 



Markov Logic Network and then uses the MC-SAT (Poon and Domingos 



2006) code from the Alchemy package to perform the sampling. 

• The learning component, LFI-ProbLog, builds heavily on the inference com- 
ponent, as explained before (Section[7]). It is essentially an Expectation Max- 
imization loop around the inference component. 

As mentioned, the above components are relatively loosely-coupled. They are bun- 
dled into a pipeline by means of Python code. A major advantage of our design is 
that it allows to build an entire ProbLog system by (mostly) using existing state- 
of-the-art programs for the different components, such as Janhunen's conversion 
program and the various d-DNNF and BDD compilers. Moreover, research on con- 
version of logic programs, knowledge compilation, weighted model counting, etc, 
continues, with new tools being released. Whenever a new tool becomes available 
for a particular component, we can benefit from this, and integrate it into our sys- 
tem. Such a design of course also has drawbacks. The two main drawbacks are that 
there is a certain latency between the components because of I/O issues, and that 
the system is complex to install and configure because of the different components 
written in different programming languages. 

ProbLog2 is available on http : //dtai ,cs. kuleuven . be/problog/p^ 



16 All experiments in this paper use c2d. 

17 In addition to the MARG, MPE and learning tasks, the ProbLog2 system supports MAP and 
decision-theoretic inference (IVan den Brocck et al. 2010L which are not discussed here. 
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9 Experiments 

The goal of our experiments is to establish the feasibility of our approach, and to 
analyze the influence of the different parameters. We focus on MARG inference and 
learning. Concretely, we aim to answer six questions. 

Ql Is working with the relevant rather than the complete ground program more 

efficient? 
Q2 Which of the two considered algorithms for converting the ground program 

to a Boolean formula (rule-based or proof-based conversion) performs best? 
Q3 Which of the two considered approaches for knowledge compilation (using 

d-DNNFs or BDDs) performs best? 
Q4 When computing success probabilities (the 'classical' ProbLog setting), does 

ProbLog2 outperform the previous ProbLog implementation? 
Q5 When learning from data generated from a known ProbLog program, can we 

recover the parameters of the original program given a reasonable amount of 

data? 
Q6 When learning from real-world data, can we obtain results comparable to the 

ones obtained with a state-of-the-art system (namely Alchemy)? 

Note that in Q6 we compare our system to Alchemy, which is the standard system 
for Markov Logic (see http://alchemy.cs.washington.edu/). 

9.1 Programs and Datasets 

We perform experiments on three types of applications. 



Social networks. We use the standard Smokers application (Domingos et al. 2008). 
The ProbLog program contains the following intensional probabilistic facts and 
rules. 



0.2 
0.3 
0.1 
0.3 



: stress (P) :- person(P) . 

: influences (PI, P2) :- f riend(Pl ,P2) 

: cancer_spont (P) :- person (P) . 

: cancer_smoke(P) :- person (P) . 



smokes (X) 
smokes (X) 
cancer (P) 
cancer (P) 



- stress (X) . 

- smokes(Y), inf luences(Y,X) 

- cancer_spont (P) . 

- smokes (P), cancer_smoke(P) 



In addition, the program contains ground (non-probabilistic) facts for the predicates 
person/ 1 and friend/2. The number of such facts is varied; see the next section. 

Collective classification. We use the relational WebKB datasetj^jln WebKB, uni- 
versity web pages need to be tagged with classes (like course page, student page, 

18 See http://www.cs.cmu.edu/~webkb/. 
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etc). This is modeled with a predicate has-dass{Page, Class). The rules in the 
ProbLog program are the following; they specify how the class C of a page P de- 
pends on the textual content of P (the words W on the page), and on the classes 
of pages that link to P. 

has_class(P,C) :- word_class(P,W,C) . 

has_class (P , C) : - has_class (P2 , C2) , link_class (P , P2 , C , C2) . 

For the predicate word-dass/3, there is a different intcnsional probabilistic fact 
for every (word,class)-pair in the dataset. Each such intensional probabilistic fact 
looks as follows. 

prob: :word_class(P,wordl ,classl) :- has_word(P,wordl) . 

The reason why we need a different intensional probabilistic fact for each (word,class)- 
pair is that for every such pair the involved probability (prob) can be different. 
Similarly, for the predicate link_class/A, there is one intensional probabilistic fact 
for every pair of classes in the dataset. 

prob: :link_class(P,P2,classl,class2) :- links_to(P,P2) . 

The predicates that occur in the 'bodies' of these intensional probabilistic facts 
(hasJword/2 and linksjto/2) are defined in the dataset. The probabilities of the 
probabilistic facts were learned from data using LFI-ProbLog. 

Probabilistic grids. For comparing ProbLog2 to the previous ProbLog implemen- 



tation, we use the classical ProbLog application of probabilistic graphs (De Raedt 



et al. 2007). The program represents a graph in which edges are labelled with a 
probability. Here we use a 16 x 16 grid as the graph. This consists of nodes n x _ y , with 
x, y G {1, . . . , 16}, lined out on a square grid with horizontal, vertical and diagonal 
directed edges between adjacent nodes. Concretely, the edges are the following. 



n 



x.y 



II, 



n x +x,y Va; G {1, . . . , 15}, y G {1, . . . , 16} {horizontal) 
Va: € {1, . . . , 16}, y G {1, • • • , 15} {vertical) 
x+i.y+i Va; G {1, ...,15}, y G {1, ...,15} {diagonal) 



^x,y ' I* 



Every edge has probability 0.5. Such a probabilistic graph is modelled in ProbLog 
by a set of probabilistic edge/2 facts. For instance, the horizontal edge from r^x 
to 712,1 is represented as the probabilistic fact 0.5: :edge(n_l_l,n_2_l). The goal 
is to find the probability of there being a path between certain nodes in the graph, 
where path is defined in the usual Prolog way. 

path(X.Y) :- edge(X,Y) . 

path(X.Y) :- edge(X,Z), path(Z.Y). 



9.2 Experimental Setup 

We now describe how we use these three programs (Smokers, WebKB and grids) in 
our experimental setup. 
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9.2.1 Inference Setup 

MARG inference. We test MARG influence on Smokers and WebKB. The main 
parameter influencing the complexity of our inference experiments is the 'domain 
size', i.e., the number of constants considered. For Smokers, the domain size is the 
number of people; for WebKB, it is the number of webpages (we take subsets of all 
pages that occur in the dataset). In our experiments, we vary the domain size and 
see how different measures, such as runtime, evolve. For each considered domain 
size we generate multiple different instances of the MARG task (10 for Smokers, 
8 for WebKB), as described below. We report median results over these different 
instances (we use median because it is more stable than arithmetic average). 

Given a particular domain size, one instance of the MARG task is generated as 
follows. (1) For both Smokers and WebKB, the program involves ground (non- 
probabilistic) facts for certain 'background' predicates. We first generate interpre- 
tations for these predicates. For Smokers, the background predicate is friend/2, 
which determines the actual social network. We use a generator of synthetic power 
law random graphs (since such graphs are known to resemble real social networks) 
and convert the obtained graph to friend/2 facts. For WebKB, the background 
predicates are hasjsiord/2 and linksjto/2, for which interpretations can be found 
in the dataset. (2) Given the domains and background facts, we select the set of 
query and evidence atoms, Q and E. For Smokers, we use 50% of all smokes/1 
and cancer /l atoms as evidence and the other smokes/l and cancer/1 atoms as 
queries. All atoms for the other predicates are neither query nor evidence. For We- 
bKB, we have a similar setup: we use 50% of all has-dass/2 atoms as evidence 
and the other has-dass/2 atoms as query. (3) The previous step generates the sets 
Q and E, but not yet the values for the evidence atoms, i.e. the vector of truth 
values e. To do so, we generate a 'sample' of the ProbLog program. This is done 
by independently sampling each ground probabilistic fact, and then computing the 
well-founded model of the resulting logic program (as dictated by the ProbLog se- 
mantics) . The result is a complete interpretation of all predicates in the program. 
From this interpretation, we extract the truth values of all atoms in the evidence 
set E, and we use these truth values to construct the vector e. (We similarly store 
the values of all atoms in the query set Q because we need them later as 'query 



ground truth'; see Section 9.3.2) 



Special case: success probability. The above is for MARG inference in the presence 
of multiple queries and evidence. In addition we also perform an experiment in the 
classical success probability setting, where the goal is to compute the probability 



of a single query, without evidence (Kimmig et al. 2010). For this experiment, we 



use the probabilistic grid program. Per experiment, we ask a single query of the 
form path(n_i_i,n_16_16) where i is being varied from 1 to 15. In other words: 
we are asking for the probabibity of there being a path from a node n^j on the 
diagonal of the grid to n 16 16 , the lower right corner of the grid. The smaller the 
value of i, the longer these paths become (and the more possible paths there are), 
and hence the harder the computation. We measure the effect of the value of i 
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on the runtime of the query in both ProbLog2 and the first ProbLog implemen- 
tation (http://dtai.cs.kuleuven.be/problog/). For each value of i, we repeat 
the experiment 10 times and average the measured runtimes. 



9.2.2 Learning Setup 

In the learning experiments, we estimate the probabilities of all probabilistic facts 
from data. 

For Smokers, we again vary the domain size. For each size we generate 170 ex- 
periments. We sample 40, 50, . . . , 200 interpretations from which we retain 10, 
40, 70 and 100 percent of the atoms together with their truth value in the partial 
interpretations. From these interpretations we learn the probabilities for all in- 
tensional probabilistic facts in the program (predicates stress/1, influences/2, 
cancer_spont/l and cancer_smoke/l). 

For WebKB, the dataset consists of four disjoint sets of webpages, one per uni- 
versity. Per university, we use only the 20 words that contain the most information 
(as measured by information gain with respect to the class labels). We perform 
four-fold cross validation using both the Alchemy system (with a standard MLN 
for this application) and LFI-ProbLog. 

The ProbLog program that we use for learning is slightly different from the one 



we use for inference. In addition to the rules given earlier (Section 9.1 1, we include 
in the program two more causes for a page to have a certain class. 

has_class(P,C) :- f ixed_prior (P,C) . 
has_class(P,C) :- learnable_prior (P,C) . 

The predicate learnablejprior /2 accounts for the pages that are tagged with a class 
that can not be explained through words and links. There is one such probabilistic 
fact for each class. 

prob: : learnable_prior(P, classl) :- page(P). 

The predicate fixedjprior/2 makes sure that every page can be tagged with every 
class. 

0.001: :fixed_prior(P,C) :-page(P), class(C). 

Finally, for computational reasons, we modify the rule that spreads influence across 
links (link_class/4) such that pages can only influence their direct neighbors. 

We learn all prob-parameters in the program (not the probability of fixed -prior / 2) . 
The learned program is too big to perform exact inference. Hence, when evaluating 
the learned program (which requires running inference), we use an approximation, 
namely we remove all probabilistic facts with a learned probability below 0.05. 



9.3 Experimental Results 

We now discuss our results in terms of the six questions raised earlier. 
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Fig. 3. Results for Smokers as a function of domain size. (When the curve for 
an algorithm ends at a particular domain size, this means that the algorithm is 
intractable beyond that size.) 
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Fig. 4. Results for WebKB as a function of domain size. (When the curve for 
an algorithm ends at a particular domain size, this means that the algorithm is 
intractable beyond that size.) 
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9.3.1 Ql - Influence of the Grounding Algorithm 

Question Ql is: is working with the relevant rather than the complete ground 
program more efficient? To answer this question, we measured the time needed for 
grounding, the size of the resulting ground program, and the size of the Boolean 
formula derived from this program. 

The grounding step. The idea behind the relevant ground program (RGP) is to 
reduce the grounding by pruning clauses that are irrelevant or inactive w.r.t. the 
queries and evidence. Our setup is such that all clauses are relevant. Hence, the only 
reduction comes from pruning inactive clauses (that have a false evidence literal in 
the body). The effect of this pruning is small: on average the size of the ground 
program is reduced by 17% (results not shown). 

Implications on the conversion to a Boolean formula. The proof-based conversion 
becomes intractable (i.e., takes prohibitively long) for large domain sizes, but the 
size where this happens is significantly larger when working on the RGP instead of 
on the complete grounding (see Fig.[3k/|4k). Also the size of the Boolean formula is 
reduced significantly by using the RGP (up to a 90% reduction in number of clauses 
in the CNF, Fig.[3]D/|4j3). The reason why a 17% reduction of the program can yield 
a 90% reduction of the corresponding formula is that loops in the program cause 
a 'blow-up' of the formula. Removing only a few rules in the ground program can 
already break loops and make the formula significantly smaller. Note that the proof- 
based conversion suffers from this blow-up more than the rule-based conversion 
does. 

Computing the grounding is always very fast, both for the RGP and the complete 
grounding (milliseconds on Smokers; around Is for WebKB). Hence, as an answer 
to question Ql, we conclude that using the RGP instead of the complete grounding 
is beneficial and comes at almost no computational cost. Hence, from now on we 
always use the RGP. 



9.3.2 Q2 - Influence of the Conversion Algorithm 

Question Q2 is: which of the two considered algorithms for converting the ground 
program to a Boolean formula performs best? Recall that we have seen a rule-based 



and a proof-based conversion (Section 5.2 ). To answer this question, we measure the 



time of the conversion process, the size of the resulting formula, and how efficient 
this formula is for inference. 

The conversion step. The proof-based algorithm, by its nature, does more effort to 
convert the program into a compact formula. This has implication on the scalability 
of the algorithm: on small domains the algorithm is fast, but on larger domains it 
becomes intractable (Fig. [3k/|4k). In contrast, the rule-based algorithm is able to 
deal with all considered domain sizes and is always fast (runtime at most 0.5s). 
A similar trend holds in terms of the size of the formula. For small domains, the 
proof-based algorithm generates smaller formulas than the rule-based algorithm, 
but for larger domains the opposite holds (Fig.^/JiJ)). 

Implications on inference. The ultimate question is how efficient the formulas 
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of the different conversions are for subsequent inference. We discuss this for exact 
inference in the next section; here we focus on approximate inference. We use the 



MC-SAT inference algorithm (Section 6.2) as a tool to evaluate how efficient the 



different formulas are for inference by running MC-SAT on the two types of formulas 
and measuring the quality of the estimated marginals. Evaluating the quality of 
approximate marginals is non-trivial when computing true marginals is intractable. 
We use the same solution as the original MC-SAT paper: we let MC-SAT run for a 
fixed time (10 minutes) and measure the quality of the estimated marginals as the 



likelihood of the 'query ground truth' according to these estimates; see Poon and 
Domingos (2006| >. 



On domain sizes where the proof-based algorithm is still tractable, inference 
results are better with the proof-based formula than with the rule-based formula 
(see Fig. [3ji, and to a smaller extent Fig. |iji). This is because the proof-based 
formulas are more compact and hence more samples can be drawn in the given 
time (Fig.^/^). 

As an answer to question Q2, we conclude that for smaller domains the proof- 
based algorithm is preferable because of the smaller formulas. On larger domains, 
the rule-based algorithm should be used. 



9.3.3 Q3 - Influence of the Inference Algorithm, 

For exact inference, our approach consists of knowledge compilation, with either 
d-DNNFs or BDDs. Question Q3 is: which of the two considered approaches, d- 
DNNFs or BDDs, performs best? To answer this question, we increase the domain 
size up to the point where inference (doing the compilation to d-DNNF or BDD) 
becomes intractable. It is useful to distinguish between compilation of rule-based 
and proof-based formulas p^ 

Proof-based formulas. On the Smokers domain, BDDs perform relatively well, but 
they are nevertheless clearly outperformed by the d-DNNFs (Fig. |3t). On WebKB, 
the difference is even larger: BDDs are only tractable on domains of size 3 or 4, 
while d-DNNFs reach up to size 18 (Fig. Eh). When BDDs become intractable, this 
is mostly due to memory problems j^j 

Rule-based formulas. As seen before, these formulas are less compact than the 
proof-based formulas (at least for those domain sizes where exact inference is fea- 
sible). The results clearly show that the d-DNNFs are much better at dealing with 
these non-compact formulas than the BDDs are. Concretely, the d-DNNFs are still 



19 In the PLP literature , BDDs have almost e xclusively been used for proof-based formulas 
( |De Raedt et al. 2007| |Gutmann et al. 2011||. Compiling our proof-based formulas to BDDs 
yields exactly the same BDDs as used by |Gutmann et al. (201lt. In the specia l case of a single 
query and no evidence, this also equals the BDDs used De Haedt et al. (20071. 



20 It might be surprising that BDDs, which are the state-ol-the-art in fLf, do not perform better. 
However, one should keep in mind that we are using BDDs for exact inference here. BDDs are 
also used fo r approximate inference, one simply compiles an approximate formula into a BDD 
( |De Raedt et al. 2007 ^. The same can be done with d-DNNFs, and we again expect improvement 
over BDDs. 
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Fig. 5. Runtime of ProbLogl and ProbLog2 on the probabilistic grid query, as a 
function of the distance 16 — i between the start and end node. (When the curve for 
an algorithm ends at a particular point, this means that the algorithm is intractable 
beyond that point.) 



tractable up to reasonable sizes. In contrast, using BDDs on these rule-based formu- 
las is nearly impossible: on Smokers the BDDs only solve size 3 and 4, on WebKB 
they even do not solve any of the inference tasks on rule-based formulas. 

As an answer to question Q3, we conclude that the use of d-DNNFs pushes the 
limit of exact MARG inference significantly further as compared to BDDs, which 
were the standard in PLP. 



9.3.4 Q4 ~ Computing Success Probabilities with ProbLog2 

Question Q4 is: when using the 'classical' ProbLog setting of computing suc- 
cess probabilities, does ProbLog2 (our new ProbLog implementation) outperform 
ProbLogl (the previous ProbLog implementation)? 

For ProbLogl we use the default parameter settings and we table the path/2 
predicate. For ProbLog2 we use the proof-based conversion and we compile to d- 
DNNF. These settings are motivated by our previous conclusions, which show that 
the proof-based conversion works well on programs that are small enough to allow 
for exact inference (as we do here) and that d-DNNFs are superior to BDDs. 



As explained before (Section 9.2.11, we ask the query path(n_i_i,n_16_16), 
where we vary i from 15 to 1. The smaller i, the larger the 'distance' 16— i between 
the start and end node, and hence the harder the problem. Figure [5] shows the 
measured runtimes for ProbLogl and ProbLog2. 

The results show that ProbLog2 scales better than ProbLogl. ProbLogl is tractable 
up to distance 6. From distance 7 onwards, it becomes intractable, i.e., it incurs a 
time-out. We have put the time limit on 300 seconds and repeated every experi- 
ment 10 times. For distance 7, all 10 repetitions timed-out, while for distance 6 the 
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average runtime was only 1.2 seconds j^j This shows that the runtime of ProbLogl 
explodes beyond distance 6. In contrast, ProbLog2 is tractable up to distance 10. 
For distance 10, all 10 repetitions finish in time, taking on average 110 seconds. 
For distance 11, only 5 out of 10 repetitions still finish in time. From distance 12 
onwards, all 10 repetitions time out. 

9.3.5 Q5 - Ability to Learn the Original Probabilities 

Question Q5 is: when learning from data generated from a known program, can we 
recover the parameters of the original program given a reasonable amount of data? 
We answer this question by generating data from the given Smokers program (Sec- 
tion 9.1 ), applying our learning algorithm to this data, and measuring the difference 
between the learned probabilities and those in the original program. We measure 
this difference in two ways. First, we use the mean absolute error (MAE) between 
both sets of probabilities. Second, we use the Kullback-Leibler(K-L)-divergence, 
a measure of similarity between a 'true' probability distribution (the one of the 
original program) and an 'approximating' distribution (the one of the learned pro- 
gram) . ProbLog allows for an efficient calculation of the K-L-divergence because of 
the independence of the probabilistic facts; see | Appendix D| 

Both the MAE (Figure ^ and the K-L-divergence (Figure [7} show that LFI- 
ProbLog can learn the original probabilities: both MAE and K-L-divergence ap- 
proach zero when more examples are given. The 100% knowledge line shows the 
optimal way of calculating the probabilities, given the interpretations. The remain- 
ing cases, 10%, 40% and 70% show that the quality of the approximations, as 
expected, drops when more atoms become unobserved. However, the approxima- 
tions remain of good quality. Hence we can conclude that LFI-ProbLog is capable 
of recovering the original probabilities and is robust against missing values. When 
we compare the figures for the different domain sizes, we see that the results are 
independent of the number of persons in the domain. 



9.3.6 Q6 - Learning of real-world data 

Question Q6 is: when learning from real-world data, can we obtain results compa- 
rable to the ones obtained with a state-of-the-art system? To answer this question, 
we compare LFI-ProbLog with the Alchemy system for Markov Logic, running 
four-fold cross validation on the WebKB dataset. Table |9.3.6| shows the negative 
log-likelihood obtained with LFI-ProbLog and Alchemy on the test-sets of the four 
folds. In the case of Alchemy, we report two results: 'Alchemy' stands for using the 
system with its default parameters, 'Alchemy*' stands for Alchemy with a modi- 
fied setting that puts a very strong prior on the weights (prior around zero, with 
standard deviation 0.1 instead of the default 100)Fj 

21 To verify that the measurement for distance 7 is not a glitch, we also tried distance 8 and 
further, but ProbLogl consistently timed-out for all of these. 

22 This modified setting was recommended to us by the Alchemy developers (personal communi- 
cation with Daniel Lowd). 



10 



D. Fierens et al. 



10% knowledge 

40% knowledge 

70% knowledge 

100% knowledge 




IOO 120 140 160 

nr interpretations 



(a) 4 persons 




10% knowledge 

40% knowledge 

70% knowledge 

100% knowledge 



IOO 120 140 

nr interpretations 



(b) 5 persons 



10% knowledge 

40% knowledge 

70% knowledge 

100% knowledge 



IOO 120 140 160 

nr interpretations 



(c) 6 persons 





10% knowledge 

40% knowledge 

70% knowledge 

100% knowledge 



IOO 120 140 ISO 

nr interpretations 



(d) 7 persons 



Fig. 6. Mean absolute error (MAE, lower is better) when learning from Smokers 
data with 10%, 40%, 70% and 100% knowledge of the possible world. 
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Fig. 7. K-L-divergence (lower is better) when learning from Smokers data with 
10%, 40%, 70% and 100% knowldege of the possible world. 
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Test Set I LFI-Problog Alchemy* Alchemy 



Cornell 


1309.72 


613.37 


1603.31 


Texas 


1210.51 


640.56 


1075.75 


Washington 


646.39 


622.55 


1420.87 


Wisconsin 


1033.90 


783.51 


3479.04 



Table 3. Negative Log-Likelihood (lower is better) on the WebKB learning experi- 
ment. 

LFI-ProbLog outperforms Alchemy with its default settings on three of the 
four folds. However, Alchemy with the strong prior (Alchemy*) outperforms LFI- 
ProbLog on all four folds. We conclude that LFI-ProbLog is competitive with 
Alchemy, but parameter tuning can have a large impact. These results illustrate 
the importance of setting a suitable prior when learning. This is a topic that we 
have not yet explored in detail for LFI-ProbLog but that we plan to study in future 
research. 



10 Conclusion 

The contributions of this paper are threefold. 

First, we have introduced a two-step procedure for MPE and MARG inference in 
general probabilistic logic programs. In a first step it generates a weighted Boolean 
formula that captures all relevant information about a specific query, evidence and 
probabilistic logic program. This step relies on well-known conversion techniques 
from logic programming. The second step then invokes well-known solvers (for in- 
stance for WMC and weighted MAX-SAT) on the generated weighted formula. 



The resulting inference procedure is akin to that employed by Darwiche (2009) 



and others (Park 2002 Sang et al. 2005) for probabilistic graphical models (where 
many inference problems are also cast in terms of weighted Boolean formulas) but 
adapted to the much more expressive class of probabilistic logic programs. Our 
conversion-based approach is advantageous because it allows us to employ a wide 
range of well-known and optimized solvers on the weighted formula, essentially giv- 
ing us "inference algorithms for free" . Furthermore, the approach also improves 
upon the state-of-the-art in probabilistic logic programming, where one has typi- 
cally focussed on inference with a single query atom and no evidence (cf. Sectionffl), 
often by using BDDs. By using d-DNNFs instead of BDDs, we obtained speed-ups 
that push the limit of exact MARG inference significantly further. 

Second, we have developed an Expectation-Maximization approach to learning 
probabilistic logic programs from interpretations. This approach employs our novel 
inference procedures in the expectation step. The learning from interpretation set- 
ting is akin to that used in the graphical model and Statistical Relational Learning 
(SRL) communities. 

Third, the two approaches have been incorporated in a novel implementation 
of the PLP language ProbLog, which unlike its previous implementation in YAP- 
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Prolog (Kimmig et al. 2010), is closer to that of answer set programming systems 
than to Prolog systems, 

Overall our approach provides new insights into the relationships between PLP, 
graphical models and SRL. As one immediate outcome, we pointed out a conversion 
of probabilistic logic programs to ground Markov Logic, which allowed us to apply 
MC-SAT to PLP inference. This contributes to further bridging the gap between 
PLP and the field of SRL. 



Appendix A Proofs 

In this appendix we give the proofs of Theorem 1 to 3. 

Proof of Theorem 1 

To prove Theorem 1 we first give necessary some lemma's. We use pruneInactive(L, 
E = e) to denote the result of removing from a ground program L all rules that are 
inactive under the evidence E = e. 

Lemma 2 

Let L be a ground normal logic program and let Li — prune!nactive(L, E = e). For 

each world/interpretation ui that is consistent with the evidence E = e it holds: 

a) for each subset A of atoms: A is an unfounded set with respect to ui under 
program L if and only if it is so under program L' , and 

b) ui is the well-founded model of L if and only if it is the well-founded model of 
L'. 

Proof: 

Part a: We use the notion of unfounded set see Definition 3.1 in I Van Ge ldcr ct al. 
|(1991[ ). We prove both directions of the 'if and only if. 

• If A is an unfounded set with respect to ui under program L, then this also 
holds under program L': 

The definition of unfounded set imposes a certain condition on each rule in 
the program whose head is in the set A, we refer to this as the unfounded rule 
condition. If we know that this condition holds for all such rules in L, then it 
also holds for all such rules in L' , because the latter set of rules is a subset of 
the former (L' is the result of removing inactive rules from L). 

• If A is an unfounded set with respect to u> under program L', then this also 
holds under program L: 

The 'if part of this 'if-then' implies that the unfounded rule condition holds 
for all rules in L' , so to prove the 'then' part we only need to show that the 
unfounded rule condition also holds for all rules in L\ L' (i.e., for all rules 
in L that were removed because of being inactive under the evidence). Every 
rule r € L\L' contains at least one atom in its body that is false according to 
the evidence (that is what made r inactive) . Since this lemma applies only to 
worlds to that are consistent with the evidence, we have for every such world 
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uj: every rule r£ L\U contains in its body at least one atom false in u. This 
is a sufficient condition to make the rule r an unfounded rule, see condition 
'1.' in Definition 3.1 in |Van Gelder et al. (1991J ). 



Part b: We can now use Part a to prove that, for every evidence-consistent world 
u, w is the well-founded model (WFM) of L if and only if it is the WFM of L' . 



The WFM is the fixed point of the W P operator (Van Gelder et al. 1991). For 



a program L, this operator is defined as Wl{uj) = T l (uj) U ->Ul(cj), see Definition 



3.3 in Van Gelder et al. (1991 ). We prove below that, for every evidence-consistent 
world uj: T l (uj) = Ty{u) and U l (uj) = U l >(uj). Hence, W L (u}) — W l >(uj), hence 
their fixed points are identical, hence Part b holds. 

• For every evidence-consistent world uj: Tl(uj) = Ty(w) 

L consists of all rules in L' plus some rules that are inactive under the evidence. 
For each evidence-consistent w, the bodies of the inactive rules in L are false 
under uj and hence these rules cannot 'fire'. Hence these rules play no role in 
the execution of the Tp operator on uj. Hence Tr,(uj) ~ Tl'(uj). 

• For every evidence-consistent world uj: Ul(uj) = Ut,'(uj): 

Ul(ui) is the greatest unfounded set with respect to (wrt) w, which is defined 
as the union of all unfounded sets wrt u>, see Definition 3.2 in | Van Gelder et aL] 
|(1991). Part a says that any subset A of atoms is an unfounded set wrt uj under 



program L if and only if it is so under program L' . Hence Ul(uj) — Ul>(gj). 



a 



Lemma 3 

Let L be a ground ProbLog program and let L' = pruneInactive(L, E = e). Then 

MOD E=e (L) = MOD E=e (L f ). 

Proof: MOD-E= e (L) is defined as the set of all worlds uj that are consistent with the 
evidence E = e and are models of the ProbLog program L, i.e., for which there exists 
a total choice C and WFM(CUR) = uj, with R the rules in L and WFM{) the well- 
founded model. The previous lemma implies that, for every uj consistent with the 
evidence, removing inactive rules from a given logic program does not alter whether 
or not uj is the WFM of that program or not. In other words: uj £ MOD-E =e (L) if 
and only if u € MOD E=e (L'). Hence MOD E = e (L) = MOD E=e (L'). D 

Lemma 4 

Let L be a ground ProbLog program and let L' = prune Lnactive(L, E = e). Then 

P L (g|E = e) = P L ,(Q|E = e) 

Proof: We prove the stronger condition Vw : Pl(uj | E = e) = Pl>(uj | E = e) The 
conditional probability Pl{uj \ E = e) of an interpretation uj according to program 
L is: 

• ( easel ) if uj e MOD E=e (L) then 

M ' ' Pi(E = e) 
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Since lu agrees with E = e, we have that the combined assignment u, E = e 
is simply equal to u>. Hence: 

P l {uj I E = e) = _ - ^ — — r • (Al) 

• (case2) if w g MOD E=e (L) then P l (cj | E = e) = 0. 

We now prove that for every w, Pl(w | E = e) = Pu{u | E = e). The proof consists 
of two parts. 

1. We need to prove that we are in easel under L if and only if we are in 
easel under L' . In other words: for every u: to £ MOD-E =e (L) if and only if 
oj e MOI?E=e(i')- This follows from the previous lemma. 

2. We need to prove that if we are in easel (i.e. if w G MOD-E =e (L)) 1 then the 
conditional probability given by the fraction in Equation |Al| is the same under 
L as under V '. 

• The numerator is the same under L and L' . This can be seen as follows. 
For any ui € MOZ?E=e(£), the probability P(u) is by definition equal 
to the probability of w's total choice. The ProbLog programs L and 
L' differ in their rules, but they have exactly the same probabilistic 
facts and hence determine the same probability distribution over total 
choices. Hence P{ui) is the same under L as under L' . 

• The sum in the denominator is also the same under L and L' . This can 
be seen as follows. First, the set MODe =b (L) over which the sum ranges 
is the same under L as under L' because of the above lemma. Second, 
each term in the sum is the same under L as under L', i.e. for every 
to G MOD-E= e {L) the probability P(ui) is the same under L as under 11 
(because of the same reasoning as for the numerator) . 

This concludes the proof. □ 

Theorem 1 

Let L be a ProbLog program and let L g be the relevant ground program for L with 

respect to Q and E = e. Then P L (Q \ E = e) = P Lg (Q | E = e). 

Proof: It follows from the grounding semantics of ProbLog that replacing the orig- 
inal program L by its full grounding (w.r.t. the Herbrand base) Lf u u preserves the 
distribution, i.e., Pl{Q \ E = e) = Pl }uU {Q, I E = e). The relevant ground program 
L g differs from Lf u u only in that it does not contain inactive rules (with respect to 
E = e) or irrelevant rules (with respect to QUE). The lemma above states that re- 
moving inactive rules preserves the distribution P(Q | E = e). Removing irrelevant 
rules also preserves this distribution; this can be seen as follows. The probability of 
an atom being true can be determined from all proofs of the atom and the probabili- 



ties of the probabilistic facts appearing in these proofs, see De Raedt et al. (2007). Ir- 
relevant rules are - by definition - rules that are not used in any proof of any atom in 
QUE. Hence omitting such irrelevant rules does not alter the distribution F(Q, E). 
Hence, also the distribution P(Q | E = e) is preserved because P(Q | E = e) can 
be defined in terms of P(Q, E), i.e., P(Q | E = e) = P ff^ e) = E ^|=fc ey . 
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Proof of Theorem 2 

Theorem 2 

Let L g be the relevant ground program for some ProbLog program with respect to 
Q and E = e. Let MOD-E =e (L g ) be those models in MOD(L g ) that are consistent 
with the evidence E = e. Let ip denote the formula and w(-) the weight function of 
the weighted formula derived from L g . Then: 

- (model equivalence) SAT(ip) = MOD E=e (L g ), 

- (weight equivalence) Vlo e SAT(ip): w(u) — Pl (w), i.e., the weight of w 
according to w(-) is equal to the probability of lo according to L g . 

Proof: The proof consists of two parts. 

Model equivalence. Consider Lemma 1 (Section 5.2). The lemma is about the 

formula <p r that captures the rules but not yet the evidence. The lemma states that 

SAT((p r ) = MOD(Lg). The present theorem is about the formula ip — ip r A cp e , 

where tp e captures the evidence. The effect of adding tp e to the formula is that 

all worlds not consistent with the evidence are ruled out. Hence SAT((p r A tp e ) — 

MOD E=e (L g ). 

Weight equivalence. Weight equivalence says that the probability of every model 

(according to L g ) is equal to the weight of the model (according to our weight 

function w(-)). This follows from the way the probability and the weight function 

are defined. 

• The probability of a model of a ProbLog program, according to the distri- 
bution semantics, is the probability of the underlying total choice, which in 
turn is defined as the product of probabilities of each of the atomic choices. 
Formally the probability of a model lo is: 

p(«)= n p(°) n p^ a )= n ?(«) n ^ -*(«))> 

aePA+(cu) aePA-(u) aePA+{ui) aePA-(ui) 

with PA + (oj) (respectively PA~{u)) being the set of all ground probabilis- 
tic atoms that are true (resp. false) in w and p(-) denoting the probability 
distribution specified by the probabilistic facts. 

• The weight of a world uj according to our weight function is the product of 
the weights of all literals I constituting the world/interpretation u>: 



?M=n«'(o- 



The literals/atoms in lo fall into four groups: probabilistic atoms that are true 
in uj (denoted PA (ui), non-probabilistic or derived atoms that are true in 
u> (denoted DA + {u)), and similar for the atoms that are false in lo (PA~(lo) 
and DA~(lo)). Hence: 

w(u>) = I I w(l) = w(a) 1 I w(-<a) w(a) w(-<a)- 

leu aePA+(io) aePA-(cj) aEDA+(uj) aEDA-(uj) 

By definition of the weight function, the weight of an atom a <G PA + (lo) is 
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p(a), the weight of a e PA~(u>) is 1 — p(a), the weight of a e DA + (ui) U 
DA~(ui) is 1. Hence: 

«/(«)= n ^(a) n a-H )) n x n ! 

aePA+(u) aeP.4-(w) aeDA+(ui) a£DA-(u) 

= n *>(«) n (i-p(a))=PM- 

a£Pyl+(u) aePi-(ij) 

This proves weight equivalence. □ 

Proof of Theorem 3 

Theorem 3 

Let L 5 be the relevant ground program for some ProbLog program with respect to 
Q and E = e. Let M. be the corresponding ground MLN. The distribution P(Q) 
according to M. is the same as the distribution P(Q | E = e) according to L g . 

Proof: We prove that (1) the set of worlds with non-zero probability according to 
the MLN is the same as the set of worlds with non-zero probability according to the 
ProbLog program and the evidence; (2) for every such world u>, Pm(u>) = Pl{w \ 
E = e) 

(Part 1) A world has non-zero probability according to an MLN if it satisfies all 
hard clauses in the MLN. The hard clauses in the MLN are the same as the clauses in 
the weighted formula ip. Hence the set of worlds with non-zero probability according 
to the MLN equals SAT(tp). Theorem 2 (model equivalence) implies that this set 
equals MOD-E =e (L g ) 1 which is exactly the set of worlds with non-zero probability 
according to the ProbLog program and the evidence. 

(Part 2) The probability of a world u> £ SAT(ip) according to an MLN is defined 
as Pm(Q) = W(u>)/Z, with W(u>) the product of exponentiated weights of the 
soft clauses satisfied in u>, and Z the normalization constant. The probability of 
ui according to the ProbLog program conditioned on the evidence is Pl(u)\E = 
e) = Pl(ui)/Pl(E = e). We now show that both expressions are the same (i.e. 
W(w)/Z = P L (w)/P L (E = e)). 

• The numerators are the same ( W(ui) = Pl(u>)): The only soft clauses in the 
MLN are unit clauses, whose weights are derived from the probabilistic facts. 
The unit clauses are such that, for any given world u>, there is one unit clause 
per probabilistic atom that is satisfied. W(u>) is the product of the expo- 
nentiated weights of all these clauses. It follows from the way these weights 
are defined in terms of the weighted formula, and from weight equivalence 
between the weighted formula and the ProbLog program (Theorem 2), that 
this product is equal to the probability of the total choice of w according to 
the ProbLog program and hence to the numerator P^(w). 

• The denominators are the same (Z = Pl(E = e)): The normalization con- 
stant Z of the MLN is defined as X^weSATC^) W(uj). The evidence probability 
P^(E = e) equals ^2 u<£ MOD E - e (L) P( < - J )- These sums are equal since (a) the 
sets over which they range are equal due to Theorem 2 (model equivalencce), 
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-burglary 




earthquake 




(a) A non smooth d-DNNF (b) A smooth d-DNNF 

Fig. CI. A d-DNNF and the corresponding smooth d-DNNF for the formula 
burglary V earthquake. 

(b) the summed terms are equal (because of the same reasoning as for the 
numerator). 

This concludes the proof. □ 

Appendix B Markov Logic 



We briefly review Markov Logic (Domingos et al. 2008). While Markov Logic gener 



ally works with FOL formulas, we consider only the ground case, as this is sufficient 
for our paper. 

A Markov Logic Network (MLN) consists of two parts: a set of 'soft' formulas /$, 
which each have an associated weight Wi E R, and a set of 'hard' formulas. An MLN 
determines a probability distribution on the set of possible worlds (determined by 
the Herbrand base). The probability of a world a; is if it violates some hard 
formula and is ie^-" 1 ^'^' otherwise, where the sum is over all soft formulas and 
5i (uj) is the indicator function being 1 if the soft formula fc is true in world u and 
otherwise. Note that the exponent J2i w i?>i(w) is the sum of weights of satisfied soft 
formulas in world w; the higher this sum, the more likely u is. The name 'MLN' 
comes from the fact that this probability distribution can also be written as the 
distribution of a Markov network. 

Appendix C The need for smoothing d-DNNFs 

The algorithm we use to compute marginal probabilities requires a smooth d-DNNF. 
A smooth d-DNNF is a d-DNNF where for every disjunction node all children use 
exactly the same set of atoms. That is, if C\ ■ ■ C n are the children of an OR node 
C, then Atoms(Ci) — Atoms(Cj), for i ^ j, where Atoms(Ci) is the set of atoms 
which Ci uses. 

Figure |C lh . shows a non-smooth d-DNNF, while Figure |C lb shows the corre- 
sponding smooth d-DNNF (which we have already shown before in Figure [Ik but 
we repeat here for convenience). Consider the non-smooth d-DNNF. The OR node 
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doesn't satisfy smoothness, since the sets of atoms of its children differ ({burglary , 
earthquake} and {burglary}; the negation is ignored here). Hence we need to trans- 
form this d-DNNF into a smooth d-DNNF, see Figure [CTp . This is done by sub- 
stituting the burglary node by an AND node, then adding the burglary node as 
a child to the new AND node and creating a new smoothing node for the missing 
atom -earthquake. The smoothing node is an OR node which links to earthquake 
and -earthqake. It is then linked to the AND node. 

Let us illustrate how smoothness affects the computation of probabilities using 
our Alarm running example (so not the restricted version of the example considered 
above). We have seen the arithmetic circuit (AC) corresponding to the smooth d- 
DNNF for this example before, recall Figure [2] on p. [22] This figure also illustrates 
how we can compute the probability of the conjunction P (earthquake — true A 
calls(john) — true). This yields the value 0.14, which is indeed the correct value. 
In contrast, Figure [C~2] shows the same evaluation process on an AC for the non- 
smooth d-DNNF. This results in an incorrect value (0.196). This shows the need 
for smoothness of the d-DNNF. 




A[calls(john)] = 1 1.0 A[hears_alarm(john)] = 1 0.7 



) ) A[alarm] = 1 1.0 



A[-burglary] = 1 0.9 A[earthquake] = 1 0.2 



Fig. C2. The arithmetic circuit corresponding to the non-smooth d-DNNF for the 
Alarm example. 



Appendix D Kullback-Leibler Divergence Between ProbLog Programs 

The Kullback-Leibler divergence D(P\\Q) is a non-symmetric measure for the dif- 
ference of two probability distributions P and Q (cf. Wasserman (2003| )). It is used 
in probability theory as well as in information theory where it is also known as 
information gain. The K-L divergence aggregates the difference of the two distribu- 
tions on all elements of the outcome space. It is only defined if the support of Q is 
larger than the one of P, that is, for all i where P(i) > also Q(i) > 0. 

We use the K-L divergence to evaluate the LFI-ProbLog learning algorithm 
(cf. Algorithm fl]) and measure how close the learned program T 2 is to the ground 
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truth program T\. We are doing parameter estimation, that is, the structure of the 
program is fixed and only the fact probabilities change. Hence we can restrict the 
definition of the K-L divergence to programs that are identical except for the fact 
probabilities. 

Definition 1 [K-L Divergence) 

Let T\ = Fi U R and T% = F% U R be ground ProbLog programs such that the 
probabilistic facts are identical except for the probabilities, that is, F\ = {pt :: 
/«|1 < i < n} and F^ — {qi :: fi\l < i < n}. Let At denote the Herbrand base of T\ 
and T2 (note that they have the same Herbrand base). We denote interpretations 
as subsets of atoms, i.e., L C At is the interpretation in which the atoms that are 
in L are true and the other atoms are false. Then the K-L Divergence between T\ 
and T2 is defined as 

D{T X \\T 2 )= J2 PrAL)\og^\ (Dl) 

There are exponentially many interpretations L C At, which makes evaluating the 
K-L divergence as defined above impossible in practice. However, the probabilistic 
facts in a ProbLog program are independent, which can be exploited to compute 
the K-L divergence in linear time by looping once over F. 

Theorem ^ 

Let T\ = F\ U R and T<z = Fi U R be ground ProbLog programs such that the 
probabilistic facts are identical except for the probabilities, that is, F\ = {pt :: 
/«|1 < i < n} and F2 — {qi :: fi\l < i < n}. Then the K-L Divergence between T\ 
and T2 can be calculated as 

r>(r 1 ||T 2 ) = ^f ft iog^ + (i- ft )iog^— ^) ■ (D2) 

i=l \ Q % Q* ' 

It is possible to extend the K-L divergence and the theorem to non-ground facts. 
To do so, one needs to multiply each summand pi log — + (1 — pi) log 1 Z P ' with the 
number of ground instances of the probabilistic fact /, . 

Proof 

We prove Theorem [4] by induction over the number of probabilistic facts. 
Base case n = 1. 



0(Ti||T a ) = J2 P TAL)\og^\ 



LQAt 

PtAUi}) 



P Tl ({/i})log 

+P Tl (0)log 



PtAUi}) 
PtM 



PtM 

= Pi log h(l-Pi)log- 

9i l-9i 



V I ft log — + (1 - p t ) log — 
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Inductive case n — ¥ n + 1. To simplify the notation, we define T[ l+1 = T\ U 
{Pn+i :: /n+i} and T 2 " +1 = T 2 U {g„+i - fn+i} 



Z?(T 1 " +i ||T™ +i ) 



E 

LC(AtU{/„+i}) 



P, 



i (L) log 






^ P r „ +1 (LU{/„ +1 }) 

E p t -+i (i u {/„ +1 » log p ; n+l(LU{/n+l}) 



LCAt 



E P^+i^iogp^ 



LCAt 






Probabilistic facts are independent and thus we can 
factorize the probabilities 



E p n+1 -P Tl (L)log Pn+1 - PTAL) 



LCAt 



q„ +1 -P T2 (L) 



(l-p^O-PrjJL) 



EJl-Pn + l)-PTdW0g \^Zl{p T2{L) 



LCAt 



using the rules for log and factoring out the constants 

Pt x W 



Pn+l 



JC P Tl (L) (log |^+lo few 



(1 -p„+i) 



LCAt 



1-P„ 



E P Tl (I) logi^+log^ 



PtxW X 



>.wj 



expanding the inner sums and factoring out constants 



Pn+l (log 



Pn+l 
Qn+1 



E Pr.W 

LCAt 



Pn+1 



v ; ^(Diiog^g) 



LCAt 



(1 " Pn + l) (log ^) 



(1 -p„+l) 



E P Tl (i) 

LCAt 



ga (l) \ 



LCAt 



E P Tl (L) (log j^ j 



since Elcai Pti(-^) I s 1j rearranging yields 



^r 2 (L) 



E p Ti W log 

LCAt 

using the inductive assumption 



^+i( 1 °gt^)+( 1 -^+i)( 1 °g^) + 
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£ P.logf + (l-p 2 )logJEf 

i — 1 ^ 

rearranging the terms 
E 1 (Pi log | + (1- Pi) log iff 



□ 
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